## Day 85 Lecture 1 Assignment

In this assignment, we will learn how to use the other layers to improve our model performance.

In [201]:
import numpy as np
import pandas as pd

We will explore a dataset containing information about twitter users and will detect whether or not the user is a bot.

In [202]:
twitter = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/training_data_2_csv_UTF.csv')

In [203]:
twitter.head()

Unnamed: 0,id,id_str,screen_name,location,description,url,followers_count,friends_count,listed_count,created_at,favourites_count,verified,statuses_count,lang,status,default_profile,default_profile_image,has_extended_profile,name,bot
0,8.16e+17,"""815745789754417152""","""HoustonPokeMap""","""Houston, TX""","""Rare and strong PokŽmon in Houston, TX. See m...","""https://t.co/dnWuDbFRkt""",1291,0,10,"""Mon Jan 02 02:25:26 +0000 2017""",0,False,78554,"""en""","{\r ""created_at"": ""Sun Mar 12 15:44:04 +0...",True,False,False,"""Houston PokŽ Alert""",1
1,4843621000.0,4843621225,kernyeahx,"Templeville town, MD, USA",From late 2014 Socium Marketplace will make sh...,,1,349,0,2/1/2016 7:37,38,False,31,en,,True,False,False,Keri Nelson,1
2,4303727000.0,4303727112,mattlieberisbot,,"Inspired by the smart, funny folks at @replyal...",https://t.co/P1e1o0m4KC,1086,0,14,Fri Nov 20 18:53:22 +0000 2015,0,False,713,en,"{'retweeted': False, 'is_quote_status': False,...",True,False,False,Matt Lieber Is Bot,1
3,3063139000.0,3063139353,sc_papers,,,,33,0,8,2/25/2015 20:11,0,False,676,en,Construction of human anti-tetanus single-chai...,True,True,False,single cell papers,1
4,2955142000.0,2955142070,lucarivera16,"Dublin, United States",Inspiring cooks everywhere since 1956.,,11,745,0,1/1/2015 17:44,146,False,185,en,,False,False,False,lucarivera16,1


In [204]:
twitter.shape

(2797, 20)

Start by getting rid of all columns that are not useful.

In [205]:
# Answer below:
twitter = twitter.drop(['id','id_str','screen_name','url','status','name','created_at','location',],axis=1)
twitter

Unnamed: 0,description,followers_count,friends_count,listed_count,favourites_count,verified,statuses_count,lang,default_profile,default_profile_image,has_extended_profile,bot
0,"""Rare and strong PokŽmon in Houston, TX. See m...",1291,0,10,0,False,78554,"""en""",True,False,False,1
1,From late 2014 Socium Marketplace will make sh...,1,349,0,38,False,31,en,True,False,False,1
2,"Inspired by the smart, funny folks at @replyal...",1086,0,14,0,False,713,en,True,False,False,1
3,,33,0,8,0,False,676,en,True,True,False,1
4,Inspiring cooks everywhere since 1956.,11,745,0,146,False,185,en,False,False,False,1
...,...,...,...,...,...,...,...,...,...,...,...,...
2792,"Twitter CMO. Favorite title: Mama. Never, ever...",18998,2005,425,2503,False,3498,en,False,False,True,0
2793,"I live in brooklyn, I'm a bike messenger, I pl...",32,54,0,1,False,97,en,True,False,False,0
2794,astrophysicist,45044433,7451,68157,24,True,9606,en,False,False,False,0
2795,"I'm quite out of my mind, actually, but people...",16,64,1,15,False,62,en,False,False,True,0


Next, get rid of all columns that contain more than 30% missing data. After that, remove all rows containing at least one missing observation.

In [206]:
# Answer below:
twitter.isnull().sum() / twitter.shape[0]


description              0.144083
followers_count          0.000000
friends_count            0.000000
listed_count             0.000000
favourites_count         0.000000
verified                 0.000000
statuses_count           0.000000
lang                     0.000000
default_profile          0.000000
default_profile_image    0.000000
has_extended_profile     0.035395
bot                      0.000000
dtype: float64

Now we will use our embedding functions from a previous assignment.

In [207]:
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def remove_stopwords(input_text):
        stopwords_list = stopwords.words('english')
        # Some words which might indicate a certain sentiment are kept via a whitelist
        whitelist = ["n't", "not", "no"]
        words = input_text.split() 
        clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
        return " ".join(clean_words)       

def stem_list(word_list):
    stemmed = []
    for word in word_list:
        stemmedword = stemmer.stem(word)
        stemmed.append(stemmedword)
    return stemmed

def normalize(terms):
    terms = str(terms)
    terms = terms.lower()
    terms = remove_stopwords(terms)
    word_delimiters = u'[\\[\\]\n.!?,;:\t\\-\\"\\(\\)\\\'\u2019\u2013 ]'
    term_list = re.split(word_delimiters, terms)
    trimmed = [x.rstrip() for x in term_list]
    stemmed = stem_list(trimmed)
    space = ' '
    normed = space.join(stemmed)
    normed = normed.replace('  ', ' ')
    return normed

We will create two branches, one branch will process the text data in the description and the other will process all other columns. First, create a numpy array with the encoded data from the description column. Normalize each description, one hot encode the text, pad the row and create a numpy array.

In [208]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [209]:
# Answer below:
twitter.description = twitter.description.apply(normalize)

In [210]:
from tensorflow.keras.preprocessing.text import one_hot
vocab_size=len(set(twitter.description.sum()))
desc = twitter.description.apply(one_hot, args=[vocab_size])
desc

0          [84, 90, 67, 53, 22, 54, 67, 18, 76, 105, 104]
1       [81, 104, 14, 81, 43, 20, 112, 10, 111, 23, 15...
2       [11, 108, 58, 1, 33, 21, 93, 96, 52, 66, 43, 9...
3                                                    [48]
4                                    [11, 61, 85, 29, 80]
                              ...                        
2792             [103, 85, 86, 62, 3, 91, 21, 42, 78, 99]
2793    [1, 22, 95, 116, 99, 62, 32, 68, 95, 116, 36, ...
2794                                                 [16]
2795    [95, 116, 54, 68, 57, 110, 4, 92, 49, 76, 76, ...
2796                   [110, 110, 12, 5, 10, 109, 67, 62]
Name: description, Length: 2797, dtype: object

In [211]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

desc = pad_sequences(desc)
desc.shape

(2797, 66)

Convert all boolean variables to numeric (zero for false and 1 for true)

In [212]:
twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2797 entries, 0 to 2796
Data columns (total 12 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   description            2797 non-null   object
 1   followers_count        2797 non-null   int64 
 2   friends_count          2797 non-null   int64 
 3   listed_count           2797 non-null   int64 
 4   favourites_count       2797 non-null   int64 
 5   verified               2797 non-null   bool  
 6   statuses_count         2797 non-null   int64 
 7   lang                   2797 non-null   object
 8   default_profile        2797 non-null   bool  
 9   default_profile_image  2797 non-null   bool  
 10  has_extended_profile   2698 non-null   object
 11  bot                    2797 non-null   int64 
dtypes: bool(3), int64(6), object(3)
memory usage: 205.0+ KB


In [213]:
twitter.verified[0]

False

In [214]:
# Answer below:
twitter['verified'] = twitter['verified'].map(lambda x: 1 if x == True else 0)
twitter['default_profile'] = twitter['default_profile'].map(lambda x: 1 if x == True else 0)
twitter['default_profile_image'] = twitter['default_profile_image'].map(lambda x: 1 if x == True else 0)


Create dummy variables out of the relevant object columns. Take caution when converting columns that may incorrectly classified as object.

In [215]:
# Answer below:
twitter['has_extended_profile'] = pd.get_dummies(twitter['has_extended_profile'],drop_first=True)
twitter['lang'] = pd.get_dummies(twitter['lang'],drop_first=True)
twitter

Unnamed: 0,description,followers_count,friends_count,listed_count,favourites_count,verified,statuses_count,lang,default_profile,default_profile_image,has_extended_profile,bot
0,rare strong pokžmon houston tx see pokžmon ht...,1291,0,10,0,0,78554,1,1,0,0,1
1,late 2014 socium marketplac make shop fundamen...,1,349,0,38,0,31,0,1,0,0,1
2,inspir smart funni folk @replyal @gimletmedia ...,1086,0,14,0,0,713,0,1,0,0,1
3,,33,0,8,0,0,676,0,1,1,0,1
4,inspir cook everywher sinc 1956,11,745,0,146,0,185,0,0,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...
2792,twitter cmo favorit titl mama never ever dull ...,18998,2005,425,2503,0,3498,0,0,0,1,0
2793,live brooklyn i m bike messeng play band i m u...,32,54,0,1,0,97,0,1,0,0,0
2794,astrophysicist,45044433,7451,68157,24,1,9606,0,0,0,0,0
2795,i m quit mind actual peopl continu find amus c...,16,64,1,15,0,62,0,0,0,1,0


Min max scale the data decribing each user (do not min max scale the word embeddings).

In [216]:
# Answer below:
X = twitter.drop('description',axis=1)
from sklearn.preprocessing import MinMaxScaler
scale = MinMaxScaler()
X = scale.fit_transform(X)


Now we'll create the two branches. Create a model for the numeric data that consists of 3 dense layers. An input layer and two hidden layers of size 32.

In [217]:
# Answer below:
from tensorflow.keras.layers import Dense, Input, Flatten
input_layer1 = Input(shape=(X.shape[1],),)
h1 = Dense(32,activation='relu')(input_layer1)
h2 = Dense(32,activation='relu')(h1)
flat1 = Flatten()(h2)



Create the second branch of the model using the encoded words. This branch will consist of 4 layers: An input layer, an embedding layer returning data of dimension 100, an LSTM layer of unit size 32 and a dense layer of unit size 32. 

In [218]:
# Answer below:
from tensorflow.keras.layers import Embedding,LSTM
input_layer2 = Input(shape=(desc.shape[1],))
h1 = Embedding(np.max(desc)+1,100)(input_layer2)
h2 = LSTM(32)(h1)
h3 = Dense(32, activation='relu')(h2)
flat2 = Flatten()(h3)

Merge the two models using the `concatenate` function (merge the two final dense layers in each branch) and create an output dense layer.

In [219]:
# Answer below:
from tensorflow.keras.layers import concatenate

merge = concatenate([flat1, flat2])
output = Dense(1, activation='sigmoid')(merge)



Create a model using the two inputs and the single output and print the summary

In [220]:
# Answer below: 
from tensorflow.keras.models import Model

model = Model(inputs=[input_layer1, input_layer2], outputs=output)
model.summary()


Model: "model_2"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_2 (InputLayer)            [(None, 66)]         0                                            
__________________________________________________________________________________________________
input_1 (InputLayer)            [(None, 11)]         0                                            
__________________________________________________________________________________________________
embedding_4 (Embedding)         (None, 66, 100)      11800       input_2[0][0]                    
__________________________________________________________________________________________________
dense_16 (Dense)                (None, 32)           384         input_1[0][0]                    
____________________________________________________________________________________________

Compile and fit the model using the appropriate optimizer, loss, and metrics. Train the model for 10 epochs with a batch size of 128.

In [223]:
# Answer below:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(x=[X, desc], y=twitter.bot, epochs=50, batch_size=128)
#50 epochs seems like a overkill, 10 was just enough

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
