# End-to-End Data Science Project

# 1. Introduction

In this Project, I downloaded Black Panther 18 dataset from kaggle, performed a sentiment analysis on the data using the VaderSentiment library and then used it to train a sentiment detection model.

# Contents
1. Introduction
2. Data source
3. Data Quality Assessment and Cleaning
4. Sentiment Analysis
5. Data Spltting
6. Model Selection and Training
7. Model Evaluation
8. Conclusion

# 2. Data Source

I downloaded Black Panther 18 dataset from kaggle

In [192]:
#importing liberies
import pandas as pd
import numpy as np
import regex as re
import cleantext
import string
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from cleantext import clean
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import warnings 
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

In [193]:
#importing data frame
DataFrame=pd.read_csv("Black Panther.csv", encoding='latin1')

In [194]:
#selecting tweets in english only
df=DataFrame[DataFrame['Language']=='en']

# 3. Data Quality Assessment And Cleaning

In [195]:
#inspect length of the dataset
len(df.index)

64010

In [196]:
#Dropping dupliacated values
df=df.drop_duplicates()

In [197]:
#inspect length of the dataset after dropping duplicates
len(df.index)

49298

In [198]:
#Inspect information of the data
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49298 entries, 0 to 57116
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   Tweets     49298 non-null  object
 1   User_name  49298 non-null  object
 2   Language   49298 non-null  object
 3   Location   34997 non-null  object
 4   Time       49298 non-null  object
dtypes: object(5)
memory usage: 2.3+ MB


In [199]:
#Inspect tweets to identify data quality issues before cleaning
df['Tweets']

0        RT @CoachWilmore: #120: William OÕNeal and the...
1        RT @soprettyinlou: I hope my girl Shuri can br...
2        RT @PollsNig: Ok guys get in here, who do you ...
3        the thing is... black panther was so so good b...
4        RT @HillaryClinton: Saw Black Panther with Bil...
                               ...                        
57097                                    _ BLACK PANTHER _
57100    ___ my fav character was Shuri! https://t.co/j...
57103        ____ #wakanda forever https://t.co/Whxkh1TdTG
57112    _____&lt;ÑÑÑ we got the official Wakanda emoji...
57116                   _____ girl https://t.co/wDQr9Zrqnx
Name: Tweets, Length: 49298, dtype: object

In [200]:
#Cleaning tweeter data 
def text_process(tweet):
    #Converting tweets to lowercase 
    tweet=tweet.lower()
    #Removing emojis
    clean(tweet, no_emoji=True)
    #Removing URL's
    tweet=re.sub(r"http\S+|www\S+|https\S+",'',tweet,flags=re.MULTILINE)
    #Removing repeating characters
    tweet=re.sub(r'\@\w+|\#\w+|\d+\rt', '', tweet)
    #Removing rt
    tweet=re.sub('rt','',tweet)
    #Removing underscores
    tweet=re.sub('_','',tweet)
    tweet=re.sub('__','',tweet)
    tweet=re.sub('___','',tweet)
    tweet=re.sub('____','',tweet)
    #Removing stopwords
    tokens=nltk.word_tokenize(tweet)
    filted_words=[w for w in tokens if w not in stopwords.words('english')]
    #Removing punctuations
    nopunc=[w for w in filted_words if w not in string.punctuation]
    lemmatizer=WordNetLemmatizer()
    lemma_words=[lemmatizer.lemmatize(w) for w in nopunc]
    return " ".join(lemma_words)
#Applying text_process function to the data frame
df['Tweets']=df['Tweets'].apply(text_process)

In [201]:
#Check if data has been cleaned as expected
df

Unnamed: 0,Tweets,User_name,Language,Location,Time
0,william oõneal murder fred hampton william oõn...,SusieNattibree,en,,Sun Mar 04 10:28:35 +0000 2018
1,hope girl shuri bring back erik round 2 black ...,zinedine_7x,en,P,Sun Mar 04 10:28:35 +0000 2018
2,ok guy get think would win battle okoye black ...,edxxtrock,en,"Toulouse, France",Sun Mar 04 10:28:35 +0000 2018
3,thing ... black panther good session late woke...,zekejaegers,en,,Sun Mar 04 10:28:36 +0000 2018
4,saw black panther bill afternoon amp loved bea...,quirion77,en,"Loire-Atlantique, Pays de la Loire",Sun Mar 04 10:28:36 +0000 2018
...,...,...,...,...,...
57097,black panther,icechology,en,city of the goddesses,Sun Mar 04 03:40:32 +0000 2018
57100,fav character shuri,mc_magic1887,en,_,Sun Mar 04 03:26:18 +0000 2018
57103,forever,jamaa222,en,"Mombasa, Kenya",Sun Mar 04 04:09:06 +0000 2018
57112,lt ñññ got official wakanda emoji right,var_aleti,en,not in school,Sun Mar 04 03:20:06 +0000 2018


# 4. Sentiment Analysis

In [202]:
analyzer=SentimentIntensityAnalyzer()
df['scores']=df['Tweets'].apply(lambda text: analyzer.polarity_scores(text) )

In [203]:
#identify the polarity
def sentimentpredict(sentiment):
    if sentiment['compound']>=0.05:
        return "Positive"
    elif sentiment['compound']<=-0.05: 
        return "Negative"
    else:
        return "Neutral"
df['label']=df['scores'].apply(lambda x: sentimentpredict(x))

In [204]:
df.drop(['User_name','Language','Time','scores','Location'],axis=1,inplace=True)

In [205]:
df

Unnamed: 0,Tweets,label
0,william oõneal murder fred hampton william oõn...,Negative
1,hope girl shuri bring back erik round 2 black ...,Positive
2,ok guy get think would win battle okoye black ...,Positive
3,thing ... black panther good session late woke...,Positive
4,saw black panther bill afternoon amp loved bea...,Positive
...,...,...
57097,black panther,Neutral
57100,fav character shuri,Positive
57103,forever,Neutral
57112,lt ñññ got official wakanda emoji right,Neutral


# 5. Data Splitting

In [206]:
from sklearn.model_selection import train_test_split

In [207]:
text_train, text_test, label_train, label_test = train_test_split(df['Tweets'],df['label'], test_size=0.33, random_state=42)

# 6.Model Selection and Training

In [208]:
from sklearn.feature_extraction.text import CountVectorizer

In [209]:
from sklearn.feature_extraction.text import TfidfTransformer

In [210]:
from sklearn.naive_bayes import MultinomialNB

In [211]:
from sklearn.pipeline import Pipeline 

In [212]:
pipeline=Pipeline([
    ('bow',CountVectorizer(analyzer=text_process)),
    ('tfidf',TfidfTransformer()),
    ('classifier',MultinomialNB())
])

In [213]:
pipeline.fit(text_train, label_train)

In [214]:
predictions=pipeline.predict(text_test)

# 7. Model Evaluation

In [215]:
from sklearn.metrics import classification_report

In [216]:
print(classification_report(label_test,predictions))

              precision    recall  f1-score   support

    Negative       0.89      0.57      0.69      4359
     Neutral       0.67      0.25      0.37      5114
    Positive       0.53      0.91      0.67      6796

    accuracy                           0.61     16269
   macro avg       0.70      0.58      0.58     16269
weighted avg       0.67      0.61      0.58     16269



# 8. Conclusion

The model is pretty good with 61% accuracy