## Day 85 Lecture 1 Assignment

In this assignment, we will learn how to use the other layers to improve our model performance.

In [1]:
import numpy as np
import pandas as pd

We will explore a dataset containing information about twitter users and will detect whether or not the user is a bot.

In [2]:
twitter = pd.read_csv('https://tf-assets-prod.s3.amazonaws.com/tf-curric/data-science/training_data_2_csv_UTF.csv')

In [3]:
twitter.bot.value_counts()

0    1476
1    1321
Name: bot, dtype: int64

In [4]:
twitter.head()

Unnamed: 0,id,id_str,screen_name,location,description,url,followers_count,friends_count,listed_count,created_at,favourites_count,verified,statuses_count,lang,status,default_profile,default_profile_image,has_extended_profile,name,bot
0,8.16e+17,"""815745789754417152""","""HoustonPokeMap""","""Houston, TX""","""Rare and strong PokŽmon in Houston, TX. See m...","""https://t.co/dnWuDbFRkt""",1291,0,10,"""Mon Jan 02 02:25:26 +0000 2017""",0,False,78554,"""en""","{\r ""created_at"": ""Sun Mar 12 15:44:04 +0...",True,False,False,"""Houston PokŽ Alert""",1
1,4843621000.0,4843621225,kernyeahx,"Templeville town, MD, USA",From late 2014 Socium Marketplace will make sh...,,1,349,0,2/1/2016 7:37,38,False,31,en,,True,False,False,Keri Nelson,1
2,4303727000.0,4303727112,mattlieberisbot,,"Inspired by the smart, funny folks at @replyal...",https://t.co/P1e1o0m4KC,1086,0,14,Fri Nov 20 18:53:22 +0000 2015,0,False,713,en,"{'retweeted': False, 'is_quote_status': False,...",True,False,False,Matt Lieber Is Bot,1
3,3063139000.0,3063139353,sc_papers,,,,33,0,8,2/25/2015 20:11,0,False,676,en,Construction of human anti-tetanus single-chai...,True,True,False,single cell papers,1
4,2955142000.0,2955142070,lucarivera16,"Dublin, United States",Inspiring cooks everywhere since 1956.,,11,745,0,1/1/2015 17:44,146,False,185,en,,False,False,False,lucarivera16,1


Start by getting rid of all columns that are not useful.

In [5]:
no_good = ['id','id_str','screen_name','location','url','created_at','lang','status','name']
twitter.drop(columns=no_good, inplace=True)
twitter.head()

Unnamed: 0,description,followers_count,friends_count,listed_count,favourites_count,verified,statuses_count,default_profile,default_profile_image,has_extended_profile,bot
0,"""Rare and strong PokŽmon in Houston, TX. See m...",1291,0,10,0,False,78554,True,False,False,1
1,From late 2014 Socium Marketplace will make sh...,1,349,0,38,False,31,True,False,False,1
2,"Inspired by the smart, funny folks at @replyal...",1086,0,14,0,False,713,True,False,False,1
3,,33,0,8,0,False,676,True,True,False,1
4,Inspiring cooks everywhere since 1956.,11,745,0,146,False,185,False,False,False,1


Next, get rid of all columns that contain more than 30% missing data. After that, remove all rows containing at least one missing observation.

In [6]:
# Answer below:
twitter.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2797 entries, 0 to 2796
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   description            2394 non-null   object
 1   followers_count        2797 non-null   int64 
 2   friends_count          2797 non-null   int64 
 3   listed_count           2797 non-null   int64 
 4   favourites_count       2797 non-null   int64 
 5   verified               2797 non-null   bool  
 6   statuses_count         2797 non-null   int64 
 7   default_profile        2797 non-null   bool  
 8   default_profile_image  2797 non-null   bool  
 9   has_extended_profile   2698 non-null   object
 10  bot                    2797 non-null   int64 
dtypes: bool(3), int64(6), object(2)
memory usage: 183.1+ KB


In [7]:
twitter.dropna(inplace=True)
twitter.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2355 entries, 0 to 2796
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   description            2355 non-null   object
 1   followers_count        2355 non-null   int64 
 2   friends_count          2355 non-null   int64 
 3   listed_count           2355 non-null   int64 
 4   favourites_count       2355 non-null   int64 
 5   verified               2355 non-null   bool  
 6   statuses_count         2355 non-null   int64 
 7   default_profile        2355 non-null   bool  
 8   default_profile_image  2355 non-null   bool  
 9   has_extended_profile   2355 non-null   object
 10  bot                    2355 non-null   int64 
dtypes: bool(3), int64(6), object(2)
memory usage: 172.5+ KB


Now we will use our embedding functions from a previous assignment.

In [8]:
from nltk.corpus import stopwords
import re
from nltk.stem.porter import PorterStemmer

stemmer = PorterStemmer()

def remove_stopwords(input_text):
        stopwords_list = stopwords.words('english')
        # Some words which might indicate a certain sentiment are kept via a whitelist
        whitelist = ["n't", "not", "no"]
        words = input_text.split() 
        clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
        return " ".join(clean_words)       

def stem_list(word_list):
    stemmed = []
    for word in word_list:
        stemmedword = stemmer.stem(word)
        stemmed.append(stemmedword)
    return stemmed

def normalize(terms):
    terms = terms.lower()
    terms = remove_stopwords(terms)
    word_delimiters = u'[\\[\\]\n.!?,;:\t\\-\\"\\(\\)\\\'\u2019\u2013 ]'
    term_list = re.split(word_delimiters, terms)
    trimmed = [x.rstrip() for x in term_list]
    stemmed = stem_list(trimmed)
    space = ' '
    normed = space.join(stemmed)
    normed = normed.replace('  ', ' ')
    return normed

In [9]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

We will create two branches, one branch will process the text data in the description and the other will process all other columns. First, create a numpy array with the encoded data from the description column. Normalize each description, one hot encode the text, pad the row and create a numpy array.

In [10]:
# Answer below:
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [11]:
twitter['norm'] = twitter['description'].apply(normalize)

In [12]:
vocab = len(set(twitter['norm'].str.split().sum()))

In [13]:
twitter['coded'] = twitter['norm'].apply(one_hot, args=[vocab])

In [14]:
var = pad_sequences(twitter['coded'])
var.shape

(2355, 66)

In [38]:
twitter.drop(columns=['norm','coded'], inplace=True)

Convert all boolean variables to numeric (zero for false and 1 for true)

In [15]:
# Answer below:
#twitter['verified'] = twitter['verified'].map(lambda x: 1 if x == True else 0)
#twitter['default_profile'] = twitter['default_profile'].map(lambda x: 1 if x == True else 0)
#twitter['default_profile_image'] = twitter['default_profile_image'].map(lambda x: 1 if x == True else 0)
#twitter.info()

In [39]:
# Answer below:
twitter['verified'] = twitter['verified'] * 1
twitter['default_profile'] = twitter['default_profile'] * 1
twitter['default_profile_image'] = twitter['default_profile_image'] * 1
twitter['has_extended_profile'] = twitter['has_extended_profile'] * 1
twitter['has_extended_profile'] = pd.to_numeric(twitter['has_extended_profile'])
twitter.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2355 entries, 0 to 2796
Data columns (total 11 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   description            2355 non-null   object
 1   followers_count        2355 non-null   int64 
 2   friends_count          2355 non-null   int64 
 3   listed_count           2355 non-null   int64 
 4   favourites_count       2355 non-null   int64 
 5   verified               2355 non-null   int64 
 6   statuses_count         2355 non-null   int64 
 7   default_profile        2355 non-null   int64 
 8   default_profile_image  2355 non-null   int64 
 9   has_extended_profile   2355 non-null   int64 
 10  bot                    2355 non-null   int64 
dtypes: int64(10), object(1)
memory usage: 220.8+ KB


Create dummy variables out of the relevant object columns. Take caution when converting columns that may incorrectly classified as object.

In [37]:
# Answer below:

Min max scale the data decribing each user (do not min max scale the word embeddings).

In [40]:
from sklearn.preprocessing import MinMaxScaler

In [46]:
# Answer below:
X = twitter.drop(columns=['description', 'bot'])

scaler = MinMaxScaler()
scaled = scaler.fit_transform(X)

array([1.34030216e-05, 0.00000000e+00, 1.61204260e-05, 0.00000000e+00,
       0.00000000e+00, 1.14454095e-02, 1.00000000e+00, 0.00000000e+00,
       0.00000000e+00])

Now we'll create the two branches. Create a model for the numeric data that consists of 3 dense layers. An input layer and two hidden layers of size 32.

In [50]:
X.shape

(2355, 9)

In [49]:
from keras.layers import Input, Dense
from keras.models import Model

In [51]:
# Answer below:
input1 = Input(shape=(9,)) 

h1 = Dense(32, activation='relu')(input1)
h2 = Dense(32, activation='relu')(h1)
output1 = Dense(32, activation='softmax')(h2)

In [None]:
model1 = Model(inputs=input1, outputs=output1)
model1.summary()

Create the second branch of the model using the encoded words. This branch will consist of 4 layers: An input layer, an embedding layer returning data of dimension 100, an LSTM layer of unit size 32 and a dense layer of unit size 32. 

In [54]:
var.shape

(2355, 66)

In [59]:
from tensorflow.keras.layers import Embedding, LSTM

In [58]:
max_words = np.max(var)+1
max_words

8020

In [60]:
# Answer below:
input2 = Input(shape=(66,))
emb = Embedding(max_words, 100)(input2)
lst = LSTM(32, activation='relu')(emb)
output2 = Dense(32, activation='softmax')(lst) 
model2 = Model(inputs=input2, outputs=output2)
model2.summary()

Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_2 (InputLayer)         [(None, 66)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 66, 100)           802000    
_________________________________________________________________
lstm (LSTM)                  (None, 32)                17024     
_________________________________________________________________
dense_3 (Dense)              (None, 32)                1056      
Total params: 820,080
Trainable params: 820,080
Non-trainable params: 0
_________________________________________________________________


Merge the two models using the `concatenate` function (merge the two final dense layers in each branch) and create an output dense layer.

In [61]:
from tensorflow.keras.layers import concatenate

In [62]:
# Answer below:
merge = concatenate([output1, output2])
output = Dense(1, activation='sigmoid')(merge)

Create a model using the two inputs and the single output and print the summary

In [64]:
from tensorflow.keras.utils import plot_model

In [65]:
# Answer below: 
model = Model(inputs=[input1, input2], outputs=output)
plot_model(model, to_file='model.png')
model.summary()

Model: "model_4"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
input_1 (InputLayer)            [(None, 9)]          0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 66)]         0                                            
__________________________________________________________________________________________________
dense (Dense)                   (None, 32)           320         input_1[0][0]                    
__________________________________________________________________________________________________
embedding (Embedding)           (None, 66, 100)      802000      input_2[0][0]                    
____________________________________________________________________________________________

Compile and fit the model using the appropriate optimizer, loss, and metrics. Train the model for 10 epochs with a batch size of 128.

In [67]:
y = twitter.bot

In [69]:
# Answer below:
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
history = model.fit(x=[X , var], y=y, batch_size=128, epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
