### <span style = 'color:green'> Capstone Project </span>
#### <span style = 'color:blue'> Problem statement : Perform sentiment analysis on Omnicron variant, data fetching directly from twitter</span>
**Sentiment analysis is the process of identifying feelings and emotions expressed in words, through ML or AI**

**Project Pipeline**

Various steps in completing project are

- **Import Necessary Dependencies**
- **Read and Load the Dataset**
- **Exploratory Data Analysis**
- **Data Visualization of Target Variables**
- **Data Preprocessing**
- **Splitting our data into Train and Test Subset**
- **Transforming Dataset using TF-IDF Vectorizer**
- **Function for Model Evaluation**
- **Model Building**
- **Conclusion**

- Here we have to get dataset directly fetched from twitter in realtime 

- performing realtime sentimental analysis on realtime data collecting from twitter
- objective: perform sentiment analysis on realtime data collected from twitter 

                      

           

### <span style = 'color:blue'>   API (Application Programm Interface)</span>
- Imagine you’re sitting at a table in a restaurant with a menu of choices to order from. The kitchen is the part of the “system” that will prepare your order. What is missing is the critical link to communicate your order to the kitchen and deliver your food back to your table. That’s where the waiter or API comes in. The waiter is the messenger – or API – that takes your request or order and tells the kitchen – the system – what to do. Then the waiter delivers the response back to you; in this case, it is the food.
- API's are huge and are used everywhere
- In simple words api stands as bridge for one to access the content in one's storage 
- There are many APIs on the Twitter platform that software developers can engage with, with the ultimate possibility to create fully automated systems which will interact with Twitter. While this feature could benefit companies by drawing insights from Twitter data

   **From twitter api it's possible to extract many insights some are**
- Tweets: searching, posting, filtering, engagement, streaming etc.
- Accounts and users (Beta): account management, user interactions.
- Media: uploading and accessing photos, videos and animated GIFs.
- Trends: trending topics in a given location.
- Geo: information about known places or places near a location.

**Getting twitter API keys**
- If you don't already have an account, you can login with your normal Twitter credentials 


- follow the required prompts to create a developer project or click here <a href="https://dev.twitter.com/apps" title="Twitter">Click here</a>
- Requesting the API key and secret via the Developer Portal causes Twitter to produce the following three things:
1. API key (this is your 'consumer key')
2. API secret key (this is your 'consumer secret')
3. Bearer token
- Next, visit the 'Authentication Tokens' area of the Developer Portal and generate an 'Access token & secret'. This will provide you with the following two items:
1. Access token (this is your 'token key')
2. Access token secret (this is your 'token secret')


**Expected output**
- the data fetched from twitter should undergo EDA for analyzing, cleaning, handling, manupulation, visualization..,etc
- final output should show the sentiment of the data


  *Some tips to consider*

- Machines can learn in every possible way so its always better to think out of the box
    
- Perform eda as diverse as possible and in contineous manner
    
- Try configuring with diffrent models to know how each model is diffrent with other ones 
    
- Donot try to involve unneccesory codes and useless algorithms for dataset which just increases complexity
    
- Approaching problem statement in n number of ways helps us to find best one possible
    
- It's easier for one to understnd and manupulate if we have models as simple as possible 
    
- When we have multiple models we can have multiple judgements based on models and their efficiencies
    
- Tuning helps increasing accuracy :)
    
- Have an idea of time consumed by the model, its better to have a model whose time management is good
    
- Spend good amount of time on analyzing dataset and draw as much insights as possible

- Tweepy is importantlibrary we will using to fetch data from twitter by api


For more on tweepy documentation please click here <a href="https://docs.tweepy.org/en/stable/getting_started.html#hello-tweepy" title="Tweepy">Click here</a>

In [None]:
# importing the libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import tweepy
import re
import time
import unicodedata
import logging
import string
import warnings
warnings.filterwarnings('ignore')
from credentials import * # pip install --upgrade credentials

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder,LabelBinarizer

In [None]:
import nltk # pip install nltk --upgrade
nltk.download('vader_lexicon')
nltk.download('stopwords')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from wordcloud import WordCloud,STOPWORDS
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize,sent_tokenize
from nltk.stem import LancasterStemmer
from bs4 import BeautifulSoup
from nltk.tokenize.toktok import ToktokTokenizer
from textblob import TextBlob,Word

In [None]:
stemmer = PorterStemmer()
leammatizer = WordNetLemmatizer()

In [None]:
tokenizer = ToktokTokenizer()
stopwords_list = nltk.corpus.stopwords.words('english')

In [None]:
#Assigning the keys
#consumer_key = ''
#consumer_secret = ''
#access_token = ''
#access_token_secret = ''

In [None]:
#auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
#auth.set_access_token(access_token, access_token_secret)
#api = tweepy.API(auth, wait_on_rate_limit=True)

In [None]:
#search_query = "'ref''omicron'-filter:retweets AND -filter:replies AND -filter:links"
#no_of_tweets = 100

#try:
#The number of tweets we want to retrieved from the search
#    tweets = api.search_tweets(q=search_query, lang="en", count=no_of_tweets, tweet_mode ='extended')
    
#Pulling Some attributes from the tweet
#    attributes_container = [[tweet.full_text, tweet.favorite_count, tweet.created.at, tweet.retweet_count] for tweet in tweets]

#Creation of column list to rename the columns in the dataframe
#   columns = ["tweets", "likes", "time", "retweet_count"]
    
#Creation of Dataframe
#    tweets_df = pd.DataFrame(attributes_container, columns=columns)
#except BaseException as e:
#    print('Status Failed On,',str(e))

In [None]:
#tweets_df.to_csv('tweets.csv')

In [None]:
#df = pd.read_csv('tweets.csv')

In [None]:
df = pd.read_csv('Omicron_data.csv') # Using the csv file provided by Skillovilla

Data Preprocessing

In [None]:
def preprocess_text(text):
    text = re.sub(r"http\S+|www\S+|https\S+", '', text)  # Remove URLs
    text = re.sub(r'\S+\.com\S+', '', text)  # Remove URLs
    text = re.sub(r'\s+', ' ', text)  # Remove extra whitespaces
    text = re.sub(r'#\w+', '', text)  # Remove hashtags
    text = re.sub(r'@\w+', '', text)  # Remove mentions
    text = re.sub('[^a-zA-Z]', ' ', text)  # Remove non-alphabetic characters
    text = text.lower()  # Convert text to lowercase
    return text

df['tweets'] = df['tweets'].apply(preprocess_text)
df.head()

Stemming

In [None]:
def simpleStemmer(text):
    ps = nltk.porter.PorterStemmer()
    text = ' '.join([ps.stem(word) for word in text.split()])
    return text

df['stem_tweets'] = df['tweets'].apply(simpleStemmer)
df.head()

Lemmatization

In [None]:
def simpleLemmatization(text):
    lemma = nltk.stem.WordNetLemmatizer()
    text = ' '.join([lemma.lemmatize(word) for word in text.split()])
    return text

df['lemm_tweets'] = df['tweets'].apply(simpleLemmatization)
df.head()

Removing StopWords

In [None]:
stopwords = nltk.corpus.stopwords.words('english')

def remove_stopwords(text,is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopwords_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopwords_list]
    filtered_text = ' '.join(filtered_tokens)
    return filtered_text

df['final_tweets'] = df['tweets'].apply(remove_stopwords)
df.head()

In [None]:
sent_df = df.copy()
sent_df["sentiment_score"] = ''
sent_df["Negative"] = ''
sent_df["Neutral"] = ''
sent_df["Positive"] = ''
sent_df.drop('tweets', axis=1, inplace=True)
sent_df.head()

In [None]:
def analyze_sentiment(tweets):
    sentiment_analyzer = SentimentIntensityAnalyzer()
    sentence_text = unicodedata.normalize('NFKD', tweets)
    try:
        sentence_sentiment = sentiment_analyzer.polarity_scores(sentence_text)
        return sentence_sentiment
    except TypeError as e:
        logging.error(f"Error analyzing sentiment for text '{tweets}': {e}")
        return None

sent_df['sentiment_score'] = sent_df['final_tweets'].apply(lambda x: analyze_sentiment(x)['compound'] if analyze_sentiment(x) else None)
sent_df['Negative'] = sent_df['final_tweets'].apply(lambda x: analyze_sentiment(x)['neg'] if analyze_sentiment(x) else None)
sent_df['Neutral'] = sent_df['final_tweets'].apply(lambda x: analyze_sentiment(x)['neu'] if analyze_sentiment(x) else None)
sent_df['Positive'] = sent_df['final_tweets'].apply(lambda x: analyze_sentiment(x)['pos'] if analyze_sentiment(x) else None)

sent_df.head()

In [None]:
positive_tweets = [ tweet for index, tweet in enumerate(sent_df['final_tweets']) if sent_df['sentiment_score'][index] > 0]
neutral_tweets = [ tweet for index, tweet in enumerate(sent_df['final_tweets']) if sent_df['sentiment_score'][index] == 0]
negitive_tweets = [ tweet for index, tweet in enumerate(sent_df['final_tweets']) if sent_df['sentiment_score'][index] < 0]

In [None]:
print("Percentage of positive tweets: {}%".format(len(positive_tweets)*100/len(sent_df['final_tweets'])))
print("Percentage of neutral tweets: {}%".format(len(neutral_tweets)*100/len(sent_df['final_tweets'])))
print("Percentage de negative tweets: {}%".format(len(negitive_tweets)*100/len(sent_df['final_tweets'])))

Analysis:

According to our analysis by fetching live data from twitter, we got to know that the sentiment of people on Omicron virus is

Approx: 39% of tweets are positive
Approx: 23% of tweets are neutral
Approx: 38% of tweets are negitive

In [None]:
# Dropping the Index Coulumn
sent_df.drop(sent_df.iloc[:,:1], axis=1, inplace=True)

In [None]:
import datetime
from datetime import datetime

sent_df['datetime'] = pd.to_datetime(sent_df['time'])
sent_df['day'] = sent_df['datetime'].dt.day
sent_df['month'] = sent_df['datetime'].dt.month
sent_df['year'] = sent_df['datetime'].dt.year

sent_df.drop('time', axis=1, inplace=True)

sent_df.head()

EDA

In [None]:
import dataprep
from dataprep.eda import create_report

In [None]:
create_report(df)

Normalised train and test reviews

In [None]:
sent_df.shape

In [None]:
norm_train_reviews = sent_df.final_tweets[:4900]
print(norm_train_reviews[1])

norm_test_reviews = sent_df.final_tweets[4900:]
print(norm_test_reviews[6997])

Bag of Words

In [None]:
norm_train_reviews = norm_train_reviews.dropna()
norm_test_reviews = norm_test_reviews.dropna()

In [None]:
cv = CountVectorizer(min_df=0.01,max_df=1.0,binary=False,ngram_range=(1,3))
cv_train_reviews = cv.fit_transform(norm_train_reviews)
cv_test_reviews = cv.transform(norm_test_reviews)

print('BOW_cv_train:',cv_train_reviews.shape)
print('BOW_cv_test:',cv_test_reviews.shape)

TF-IDF Vectorizer; Term frequency and inverse document frequency

In [None]:
tv = TfidfVectorizer(min_df=0.01,max_df=1.0,use_idf=True,ngram_range=(1,3))
tv_train_reviews = tv.fit_transform(norm_train_reviews)
tv_test_reviews=tv.transform(norm_test_reviews)


print('Tfidf_train:',tv_train_reviews.shape)
print('Tfidf_test:',tv_test_reviews.shape)

Plotting graph to visualize the most common words in tweets.

In [None]:
tok = lambda x:word_tokenize(x)
df['tokenize'] = df['final_tweets'].apply(tok)
df.head()

In [None]:
import itertools,collections

new_tokenize = df['tokenize']
all_words = list(itertools.chain(*new_tokenize))

In [None]:
counts = collections.Counter(all_words)
count_frequency = counts.most_common(200)

clean_tweets = pd.DataFrame(counts.most_common(200),columns=['words', 'count'])
clean_tweets.head()

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))
clean_tweets[:25].sort_values(by='count').plot.barh(x='words',y='count',ax=ax,color='maroon')
ax.set_title("Common Words Found in Tweets (Including All Words)")
plt.show()

Categorize tweets as Positive, Negative or Neutral.

In [None]:
SAN = sent_df['sentiment_score'].apply(lambda x: "Positive" if x>=0.05 else ("Negative" if x<= -0.05 else "Neutral"))
new_review = sent_df['final_tweets']
new_review = new_review.tolist()
SAN = SAN.tolist()

dict = {'final_tweets':new_review, 'sentiment_score':SAN}
pnn = pd.DataFrame(dict)
pnn.head()

Finding the most common words used positive and negative tweets

In [None]:
plt.figure(figsize=(5,5))
positive_text=norm_train_reviews[1]
WC=WordCloud(width=1000,height=500,max_words=500,min_font_size=5)
positive_words=WC.generate(positive_text)
plt.imshow(positive_words,interpolation='bilinear')
plt.show()

In [None]:
plt.figure(figsize=(5,5))
negative_text=norm_train_reviews[8]
WC=WordCloud(width=1000,height=500,max_words=500,min_font_size=5)
negative_words=WC.generate(negative_text)
plt.imshow(negative_words,interpolation='bilinear')
plt.show()

Converting the numerical data to categorical - Positive and Negative

In [None]:
sent_df['Sentiment'] = sent_df['sentiment_score'].apply(lambda x: "Positive" if x>=0 else "Negative")

sent_df.head()

Label Binarizer

In [None]:
lb = LabelBinarizer()
sent_df['label'] = lb.fit_transform(sent_df['Sentiment'])

sent_df.head()

Train, Test & Split

In [None]:
ml_df = sent_df.copy()
ml_df.drop(['retweet_count', 'stem_tweets', 'lemm_tweets', 'Negative', 'Neutral', 'Positive'], axis=1, inplace=True)

ml_df.head()

In [None]:
train = ml_df.label[:4900]
test = ml_df.label[4900:]

ml_df.shape

Logistic Regression

In [None]:
lr = LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
lr_bow = lr.fit(cv_train_reviews,train) #From Bag of Words
print(lr_bow)

lr_tfidf = lr.fit(tv_train_reviews,train) #TFIDF 
print(lr_tfidf)

In [None]:
lr_bow_predict = lr.predict(cv_test_reviews)
print(lr_bow_predict)
lr_tfidf_predict = lr.predict(tv_test_reviews)
print(lr_tfidf_predict)

In [None]:
print(lr_bow_predict.shape)
print(lr_tfidf_predict.shape)

In [None]:
cv_test_reviews[0].toarray()

In [None]:
text = ml_df['final_tweets'][4900:].tolist()

In [None]:
dict = {'text':text,'test':test, 'bow':lr_bow_predict, 'tfidf': lr_tfidf_predict}
df1 = pd.DataFrame(dict)

In [None]:
df1.head()

In [None]:
df1[df1['test']!=df1['bow']]

Accuracy Scores

In [None]:
lr_bow_score = accuracy_score(test,lr_bow_predict)
print(lr_bow_score)
lr_tfidf_score = accuracy_score(test,lr_tfidf_predict)
print(lr_tfidf_score)

Classification Report

In [None]:
lr_bow_report=classification_report(test,lr_bow_predict,target_names=['Negative','Positive'])
print(lr_bow_report)
lr_tfidf_report = classification_report(test,lr_tfidf_predict,target_names=['Negative','Positive'])
print(lr_tfidf_report)

Confusion Matrix

In [None]:
cm_bow = confusion_matrix(test,lr_bow_predict,labels=[0,1])
print(cm_bow)
cm_tfidf = confusion_matrix(test,lr_tfidf_predict,labels=[0,1])
print(cm_tfidf)

Support Vector Machine

In [None]:
lr = LogisticRegression(penalty='l2',max_iter=500,C=1,random_state=42)
lr_bow = lr.fit(cv_train_reviews,train) #From Bag of Words
print(lr_bow)

lr_tfidf = lr.fit(tv_train_reviews,train) #TFIDF 
print(lr_tfidf)

Linear support vector machines for bag of words and tfidf features

In [None]:
from sklearn import svm
svm = svm.SVC(kernel = 'linear', random_state = 0, C=1.0)
svm_bow = svm.fit(cv_train_reviews,train)
print(svm_bow)
svm_tfidf = svm.fit(tv_train_reviews,train)
print(svm_tfidf)

Model Building and Evaluation

In [None]:
svm_bow_predict = svm.predict(cv_test_reviews)
print(svm_bow_predict)
svm_tfidf_predict = svm.predict(tv_test_reviews)
print(svm_tfidf_predict)

Accuracy Scores

In [None]:
svm_bow_score = accuracy_score(test,svm_bow_predict)
print(svm_bow_score)
svm_tfidf_score = accuracy_score(test,svm_tfidf_predict)
print(svm_tfidf_score)

Classification Report

In [None]:
svm_bow_report = classification_report(test,svm_bow_predict,target_names=['Positive','Negative'])
print(svm_bow_report)
svm_tfidf_report = classification_report(test,svm_tfidf_predict,target_names=['Positive','Negative'])
print(svm_tfidf_report)

Confusion Matrix

In [None]:
cm_bow = confusion_matrix(test,svm_bow_predict,labels=[1,0])
print(cm_bow)
cm_tfidf = confusion_matrix(test,svm_tfidf_predict,labels=[1,0])
print(cm_tfidf)

KNN

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()

In [None]:
knn_bow = knn.fit(cv_train_reviews,train)
print(knn_bow)
knn_tfidf = knn.fit(tv_train_reviews,train)
print(knn_tfidf)

In [None]:
knn_bow_predict = knn.predict(cv_test_reviews)
print(knn_bow_predict)
knn_tfidf_predict = knn.predict(tv_test_reviews)
print(knn_tfidf_predict)

In [None]:
knn_bow_score = accuracy_score(test,knn_bow_predict)
print(knn_bow_score)
knn_tfidf_score = accuracy_score(test,knn_tfidf_predict)
print(knn_tfidf_score)

In [None]:
knn_bow_report = classification_report(test,knn_bow_predict,target_names=['Positive','Negative'])
print(knn_bow_report)
knn_tfidf_report = classification_report(test,knn_tfidf_predict,target_names=['Positive','Negative'])
print(knn_tfidf_report)

In [None]:
cm_bow = confusion_matrix(test,knn_bow_predict,labels=[0,1])
print(cm_bow)
cm_tfidf = confusion_matrix(test,knn_tfidf_predict,labels=[0,1])
print(cm_tfidf)

Decision Tree

In [None]:
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier (criterion = 'entropy',max_depth=3)

In [None]:
dtc_bow = dtc.fit(cv_train_reviews,train)
print(dtc_bow)
dtc_tfidf = dtc.fit(tv_train_reviews,train)
print(dtc_tfidf)

In [None]:
dtc_bow_predict = dtc.predict(cv_test_reviews)
print(dtc_bow_predict)
dtc_tfidf_predict = dtc.predict(tv_test_reviews)
print(dtc_tfidf_predict)

In [None]:
dtc_bow_score = accuracy_score(test,dtc_bow_predict)
print(dtc_bow_score)
dtc_tfidf_score = accuracy_score(test,dtc_tfidf_predict)
print(dtc_tfidf_score)

In [None]:
dtc_bow_report = classification_report(test,dtc_bow_predict,target_names=['Positive','Negative'])
print(dtc_bow_report)
dtc_tfidf_report = classification_report(test,dtc_tfidf_predict,target_names=['Positive','Negative'])
print(dtc_tfidf_report)

In [None]:
cm_bow = confusion_matrix(test,dtc_bow_predict,labels=[0,1])
print(cm_bow)
cm_tfidf = confusion_matrix(test,dtc_tfidf_predict,labels=[0,1])
print(cm_tfidf)