![title](header1.png "Header")
<h3 align="center">Use 1.6-million-tweet dataset to train a classifier model that can help customers know the evaluations on Twitter platform about a product</h3> 
___

# Content
### 1. [Data set insights](#01)
* Data set
* [Data cleaning (round 1)](#1.2)
* [TextBlob for data evaluation](#1.3)
* [WordCloud for tweet word frequency](#1.4)

### 2. Model training
* Word embedding and vectorizing
* RNN modeling
* ML modeling: TF-IDF & Random Forest
* Modeling evaluations and saving

### 3. Model improvement
* Encoding new target
* Cleaning

### 4. Model applying
* [Tweet retrieval simulation using tweepy](#4.1)
* [Tweet text processing](#4.2)
* Sentiment analysis using the model
### 5. Web deployment

# Tweet retrieval using Tweepy

#### Aim
Simulate a request from an user to search for perceptions of a product in Twitter.

#### Description
* set up variables for Twitter API
* use api.search to obtain tweets with specific queries and filters
* convert tweet's contents to a data frame
* preprocess the content to extract information and reduce noises
* save as a csv file

In [1]:
# Variables for Twitter API
ACCESS_TOKEN = '917808967274741762-r8YbHiXfNrUXJXn1dubUchtWrOZ2DMk'
ACCESS_TOKEN_SECRET = 'jWo0gkDV5VGzaBkDeAkYc1e3dzf4D7N5IpuV6QdjR9x5E'
CONSUMER_API_KEY = 'QaEAZ16UhHcc33CiZ6y8b672n'
CONSUMER_API_SECRET = 'EIwXgXYGFBYZbMqTt22byRnQiQuUfWKqxbM03KKCiCTXiE8tVA'


In [2]:
# Import requred libraries
import tweepy as tw
import pandas as pd
import numpy as np
pd.set_option('max_colwidth',150)

### Get Tweets

In [3]:
# Create and authentication object
auth = tw.OAuthHandler(CONSUMER_API_KEY, CONSUMER_API_SECRET)

# Setting your access token and secret
auth.set_access_token(ACCESS_TOKEN, ACCESS_TOKEN_SECRET)

# Create the API object passing the auth object
api = tw.API(auth)

### Using cursor object

In [26]:
# Search for upto 100 most recent tweets about "Lambton College", parse to a Cursor obj
# We here do not want to include retweets
# search_words='LambtonCollege OR (Cestar College) OR (Lambton College)'
search_words = 'toronto university'
tweet_search = search_words + ' -filter:retweets AND filter:replies'
tweets = tw.Cursor(api.search, q=tweet_search, count=10, lang='en').items(100)

### Iterate the retrieved tweet records and save to a data frame

In [27]:
# Store tweets in a list
tweet_component = []
for tweet in tweets:
    print(f'{tweet.user.name} said: {tweet.text}')
    tweet_component.append([tweet.user.id, tweet.created_at,
                       tweet.user.screen_name, tweet.user.location, tweet.text.encode('utf-8')])


asmi said: @Sonnipun ight 
1. university of toronto
2. university of british columbia
3. university college london
4. universi… https://t.co/K6jBimYJi6
Brian Dixon said: @verslibre Cust: "I am currently a PhD student in the philosophy department at York University in Toronto.
My areas… https://t.co/O4v8HoJOfi
Hannah - aka crip cinnamon lid ♿️ said: Citation: Maureen K. Lux, _Separate Beds: A History of Indian Hospitals in Canada, 1920s-1980s_, (Toronto: Universi… https://t.co/lJuxJ4goVC
🏙 🍁 🇨🇦 said: @krismeloche @CarymaRules @fordnation Then let Laurentian U collapse.
Cancelled a number of planned campuses.
Cance… https://t.co/OfeNXo50jr
Derek Estabrook said: @MegMcMorris On the + side, he’s a young, bright engineer and a process improvement expert. He started a new role l… https://t.co/RlkZY1qksQ
Peoples Party of Canada North Okanagan Shuswap said: @SpencerFernando We need to expose the University of Toronto @wef paid off shills in their epidemiology department.… https://t.co/lne4reIR

Fulford Academy said: 📌University of Toronto (Scarborough)
📌University of Toronto (St. George)
📌University of Winnipeg
📌Waterloo Universi… https://t.co/IlkULUu9Pu
Fulford Academy said: 📌Masaryk Medical University (Czech Republic)
📌McMaster University
📌OCAD
📌Ottawa University
📌Queen's University
📌She… https://t.co/1sAtVo74qt
Christopher J. Rutty said: @uoftmedicine @SHeximer @canadapostcorp @BantingHouse The vial of "Insulin Toronto" featured on the new stamp was p… https://t.co/imbADqLglO
Gary Wagman said: @EricaBrecher My late Uncle Dr Murray Wagman and his family lived in Metuchen New Jersey from 1946 to his death in… https://t.co/zMjAdyIb3c
Social Planning Council of Ottawa said: _

Rachel Bromberg is the Co-Founder of the @reachout_to, the Executive Director of the International Mobile Servic… https://t.co/tuFSe3xDrQ
P Rowson said: @thatginamiller The example set by Banting, Collip &amp; Best is exemplary, “On Jan. 23, 1923, Banting, Collip and Best… https://t.co/JjjJa9Urzb
Univers

In [28]:
# Convert to DataFrame that contains information including username, location.
df_tweet = pd.DataFrame(tweet_component, columns=['ID', 'Time', 'User name', 'Location', 'Text'])
df_tweet.head(9)

Unnamed: 0,ID,Time,User name,Location,Text
0,1040519558442868738,2021-04-17 07:02:40,thatasmi,"Indore, India",b'@Sonnipun ight \n1. university of toronto\n2. university of british columbia\n3. university college london\n4. universi\xe2\x80\xa6 https://t.co...
1,25270260,2021-04-17 03:57:51,BrianBoruNZ,"Dunedin, New Zealand","b'@verslibre Cust: ""I am currently a PhD student in the philosophy department at York University in Toronto.\nMy areas\xe2\x80\xa6 https://t.co/O4..."
2,2407031720,2021-04-17 03:13:45,HannahntheWolf,Occupied Musqueam Territory,"b'Citation: Maureen K. Lux, _Separate Beds: A History of Indian Hospitals in Canada, 1920s-1980s_, (Toronto: Universi\xe2\x80\xa6 https://t.co/lJu..."
3,991132349663412224,2021-04-17 02:56:13,theKeenUrbanist,"Toronto, Ontario, Canada",b'@krismeloche @CarymaRules @fordnation Then let Laurentian U collapse.\nCancelled a number of planned campuses.\nCance\xe2\x80\xa6 https://t.co/O...
4,402370541,2021-04-17 02:27:52,estabde,Halifax,"b'@MegMcMorris On the + side, he\xe2\x80\x99s a young, bright engineer and a process improvement expert. He started a new role l\xe2\x80\xa6 https..."
5,1080573064503410688,2021-04-17 00:42:01,ppcnos,"Vernon, British Columbia",b'@SpencerFernando We need to expose the University of Toronto @wef paid off shills in their epidemiology department.\xe2\x80\xa6 https://t.co/lne...
6,264385227,2021-04-16 23:55:14,TravPederson,Winnipeg,b'@DrKaliBarrett is a Critical Care Physician with the University Health Network in Toronto. She is a member of the S\xe2\x80\xa6 https://t.co/do3...
7,386007819,2021-04-16 22:30:35,MLGG2,Canada,b'@SabiVM @UofT_dlsph @AmitAryaMD @NaheedD @DFisman @bernardcampagna @picardonhealth @Andre_Lariviere @WHO\xe2\x80\xa6 https://t.co/gov7gyRsBa'
8,386007819,2021-04-16 21:41:50,MLGG2,Canada,"b'@RobynUrback Ford\xe2\x80\x99s advisers are not physicians not epidemiologists, not very strong academically. Aldasteinn Brown\xe2\x80\xa6 https..."


In [29]:
df_tweet.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   ID         100 non-null    int64         
 1   Time       100 non-null    datetime64[ns]
 2   User name  100 non-null    object        
 3   Location   100 non-null    object        
 4   Text       100 non-null    object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 4.0+ KB


### Tweet cleaning

In [30]:
import nltk
from nltk.corpus import stopwords
import re
import string

In [31]:
### Use the same text pre-processing procedure as one in tweet sentiment analysis
def data_cleaner(tweet):
    # lower text
    tweet = tweet.lower()
    
    # remove urls
    tweet = re.sub(r'http\S+', ' ', str(tweet))
    
    # remove html tags
    tweet = re.sub(r'<.*?>',' ', tweet)
    
    # remove tweets containing digits
    tweet = re.sub('\w*\d\w*', ' ', tweet)
    
    # remove hashtags
    tweet = re.sub(r'#\w+',' ', tweet)
    
    # remove mentions
    tweet = re.sub(r'@\w+',' ', tweet)
    
    # remove anything not a letter
    tweet = re.sub(r'\W+', ' ', tweet)
    
    tweet = tweet.strip()
    
    #removing words with length less than 2 or in stop words
    tweet = tweet.split()
    tweet = [w for w in tweet if len(w)>1]
    tweet = " ".join([word for word in tweet if not word in stop_words])
    
     
    return tweet


stop_words = stopwords.words('english')

df_tweet['cleaned_tweet'] = df_tweet['Text'].apply(data_cleaner)

# Drop nan in cleaned_tweet column
df_tweet.replace('', np.nan, inplace=True)
df_tweet = df_tweet.dropna(subset=['cleaned_tweet'])

df_tweet = df_tweet[['User name', 'Time', 'Location', 'cleaned_tweet']]
df_tweet.head(3)


Unnamed: 0,User name,Time,Location,cleaned_tweet
0,thatasmi,2021-04-17 07:02:40,"Indore, India",ight university toronto university british columbia university college london universi
1,BrianBoruNZ,2021-04-17 03:57:51,"Dunedin, New Zealand",cust currently phd student philosophy department york university toronto nmy areas
2,HannahntheWolf,2021-04-17 03:13:45,Occupied Musqueam Territory,citation maureen lux _separate beds history indian hospitals canada toronto universi


In [15]:
# Save to a csv file
df_tweet.to_csv('retrieved_tweet.csv', index=False, encoding='utf-8')

In [32]:
df_tweet.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 97 entries, 0 to 99
Data columns (total 4 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   User name      97 non-null     object        
 1   Time           97 non-null     datetime64[ns]
 2   Location       73 non-null     object        
 3   cleaned_tweet  97 non-null     object        
dtypes: datetime64[ns](1), object(3)
memory usage: 3.8+ KB


## Model application


In [16]:
import joblib
import numpy as np

### Load the tf-idf rf model

In [17]:
# Load pipeline and apply for retrieved tweet
model = joblib.load('tfidf_rf_pipeline.sav')

In [18]:
df_tweet = pd.read_csv('retrieved_tweet.csv')
df_tweet.cleaned_tweet


0                                                toyota lexus good size even convertible
1                                   saying toyota makes lexus hyundai makes genesis true
2                       japanese government crafted economy decades around kind generati
3                            upgraded nissan cefiro nissan maxima tv series toyota lexus
4                                                  meet peter worked habanero consulting
                                             ...                                        
93                two toyota house hold luxury brand extremely disappointing stop giving
94                      say cars money maybe talk saloon go understand specifically said
95    dey craze like toyota cars dey drive lexus tundra cause na mistubushi dey make dem
96                      hello thank message yes course could interested brand company se
97                          circa toyota lexus beat reliability sats saved acquiring one
Name: cleaned_tweet, 

In [21]:
df_tweet['labels'] = df_tweet['cleaned_tweet'].apply(lambda x: model.predict([x])[0])

In [22]:
df_tweet

Unnamed: 0,User name,Time,Location,cleaned_tweet,labels
0,kimberlydark,2021-04-17 06:28:47,at large,toyota lexus good size even convertible,1
1,jortle,2021-04-17 03:29:18,"Houston, TX",saying toyota makes lexus hyundai makes genesis true,1
2,ArthurWilliam52,2021-04-17 02:52:43,East Mids,japanese government crafted economy decades around kind generati,1
3,murdock_tm,2021-04-17 02:38:53,USA,upgraded nissan cefiro nissan maxima tv series toyota lexus,0
4,camcavers,2021-04-17 01:12:52,"Vancouver, BC",meet peter worked habanero consulting,0
...,...,...,...,...,...
93,abigailm1971,2021-04-14 11:56:49,"Orlando, FL",two toyota house hold luxury brand extremely disappointing stop giving,0
94,TheycallmeAGU_,2021-04-14 11:47:14,,say cars money maybe talk saloon go understand specifically said,0
95,xomtochukwu,2021-04-14 11:37:12,"Lagos de Moreno, Jalisco",dey craze like toyota cars dey drive lexus tundra cause na mistubushi dey make dem,0
96,anthoriv,2021-04-14 09:36:20,La Réunion,hello thank message yes course could interested brand company se,1


In [23]:
def sentiment(df):
    if df['labels'].value_counts()[1]/len(df) >= 0.8:
        return 'Highly recommend!'
    elif (df['labels'].value_counts()[1]/len(df) >= 0.6) & (df['labels'].value_counts()[1]/len(df) < 0.8):
        return 'Recommend!'
    elif (df['labels'].value_counts()[1]/len(df) >= 0.4) & (df['labels'].value_counts()[1]/len(df) < 0.6):
        return 'Normal Quality.'
    else: return 'Not recommend!'

a = sentiment(df_tweet)
print(a)


Not recommend!


In [33]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer

# Initialize sentiment intensity analyzer
sia = SentimentIntensityAnalyzer()
# Obtaining NLTK scores
# can_mar_cleaned['nltk_scores'] = can_mar_cleaned['cleaned_r1'].apply(lambda x: sia.polarity_scores(x)['compound'])
df_tweet['vader'] = df_tweet['cleaned_tweet'].apply(lambda x: sia.polarity_scores(x)['compound'])
df_tweet.head()

Unnamed: 0,User name,Time,Location,cleaned_tweet,vader
0,thatasmi,2021-04-17 07:02:40,"Indore, India",ight university toronto university british columbia university college london universi,0.0
1,BrianBoruNZ,2021-04-17 03:57:51,"Dunedin, New Zealand",cust currently phd student philosophy department york university toronto nmy areas,0.0
2,HannahntheWolf,2021-04-17 03:13:45,Occupied Musqueam Territory,citation maureen lux _separate beds history indian hospitals canada toronto universi,0.0
3,theKeenUrbanist,2021-04-17 02:56:13,"Toronto, Ontario, Canada",let laurentian collapse ncancelled number planned campuses ncance,-0.4404
4,estabde,2021-04-17 02:27:52,Halifax,side young bright engineer process improvement expert started new role,0.7096


In [34]:
np.mean(df_tweet['vader'].values)

0.046994845360824745

In [35]:
df_tweet

Unnamed: 0,User name,Time,Location,cleaned_tweet,vader
0,thatasmi,2021-04-17 07:02:40,"Indore, India",ight university toronto university british columbia university college london universi,0.0000
1,BrianBoruNZ,2021-04-17 03:57:51,"Dunedin, New Zealand",cust currently phd student philosophy department york university toronto nmy areas,0.0000
2,HannahntheWolf,2021-04-17 03:13:45,Occupied Musqueam Territory,citation maureen lux _separate beds history indian hospitals canada toronto universi,0.0000
3,theKeenUrbanist,2021-04-17 02:56:13,"Toronto, Ontario, Canada",let laurentian collapse ncancelled number planned campuses ncance,-0.4404
4,estabde,2021-04-17 02:27:52,Halifax,side young bright engineer process improvement expert started new role,0.7096
...,...,...,...,...,...
95,crollyson,2021-04-14 19:47:17,,university toronto graduate count,0.0000
96,junglejava1,2021-04-14 19:41:48,Canada,according toronto star least study data eventually found way economist texas amp university,0.0000
97,Fergie_Kate,2021-04-14 19:39:51,"Glasgow, Glasgow City G31",university toronto love canada scottish,0.6369
98,FemOrgChem,2021-04-14 19:18:09,,dr dong began academic career university toronto worked heterocycles medicinal chemist,0.0000
