# Transfer Learning CIFAR10

* Train a simple convnet on the CIFAR dataset the first 5 output classes [0..4].
* Freeze convolutional layers and fine-tune dense layers for the last 5 ouput classes [5..9].


### 1. Import CIFAR10 data and create 2 datasets with one dataset having classes from 0 to 4 and other having classes from 5 to 9 

In [2]:
import keras
import numpy as np
import keras.utils as np_utils
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten
from keras.layers import Conv2D, MaxPooling2D,Activation
from keras.datasets import cifar10

Using TensorFlow backend.


In [3]:
(x_train, y_train), (x_test, y_test) = cifar10.load_data()

print(y_train.shape[0],y_test.shape[0])

Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
50000 10000


In [4]:
x_train04 = []
x_test04 = []
x_train59 = []
x_test59 = []
y_train04 = []
y_test04 = []
y_train59 = []
y_test59 = []

for ix in range(y_train.shape[0]):
    if y_train[ix] < 5:
        # put data in set 1
        x_train04.append(x_train[ix]/255.0)
        y_train04.append(y_train[ix])
    else:
        # put data in set 2
        x_train59.append(x_train[ix]/255.0)
        y_train59.append(y_train[ix])

for ix in range(y_test.shape[0]):
    if y_test[ix] < 5:
        # put data in set 1
        x_test04.append(x_test[ix]/255.0)
        y_test04.append(y_test[ix])
    else:
        # put data in set 2
        x_test59.append(x_test[ix]/255.0)
        y_test59.append(y_test[ix])


x_train04 = np.asarray(x_train04).reshape((-1, 32, 32, 3))
x_test04 = np.asarray(x_test04).reshape((-1, 32, 32, 3))
x_train59 = np.asarray(x_train59).reshape((-1, 32, 32, 3))
x_test59 = np.asarray(x_test59).reshape((-1, 32, 32, 3))

print(x_train04.shape,x_test04.shape)
print(x_train59.shape ,x_test59.shape)

(25000, 32, 32, 3) (5000, 32, 32, 3)
(25000, 32, 32, 3) (5000, 32, 32, 3)


### 2. Use One-hot encoding to divide y_train and y_test into required no of output classes

In [0]:
y_train04 = np_utils.to_categorical(np.asarray(y_train04),5)
y_test04 = np_utils.to_categorical(np.asarray(y_test04), 5)

y_train59= np.asarray(y_train59)-5
y_test59= np.asarray(y_test59)-5
y_train59 = np_utils.to_categorical(y_train59,5)
y_test59 = np_utils.to_categorical(y_test59, 5)

### 3. Build a sequential neural network model which can classify the classes 0 to 4 of CIFAR10 dataset with at least 80% accuracy on test data

In [6]:
model = Sequential()

model.add(Conv2D(32, 3, 3, input_shape=(32, 32, 3), activation='relu'))
model.add(Conv2D(32, 3, 3, activation='relu'))
model.add(MaxPooling2D(pool_size=(2,2)))

model.add(Dropout(0.25))

model.add(Conv2D(64, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Dropout(0.25))


model.add(Flatten())
#model.add(Dropout(0.25))


model.add(Dense(256))
model.add(Activation('relu'))

model.add(Dense(5))
model.add(Activation('softmax'))

model.summary()
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.fit(x_train04, y_train04,
         epochs=10,
         batch_size=32,
         validation_data=(x_test04, y_test04))


W0714 14:51:34.864091 139853932865408 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:74: The name tf.get_default_graph is deprecated. Please use tf.compat.v1.get_default_graph instead.

  This is separate from the ipykernel package so we can avoid doing imports until
W0714 14:51:34.889072 139853932865408 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:517: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0714 14:51:34.892494 139853932865408 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:4138: The name tf.random_uniform is deprecated. Please use tf.random.uniform instead.

  after removing the cwd from sys.path.
W0714 14:51:34.925229 139853932865408 deprecation_wrapper.py:119] From /usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py:3976: The name t

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d_1 (Conv2D)            (None, 30, 30, 32)        896       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 28, 28, 32)        9248      
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 14, 14, 32)        0         
_________________________________________________________________
dropout_1 (Dropout)          (None, 14, 14, 32)        0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 14, 14, 64)        18496     
_________________________________________________________________
activation_1 (Activation)    (None, 14, 14, 64)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 12, 12, 64)        36928     
__________

<keras.callbacks.History at 0x7f320df75e10>

### 4. In the model which was built above (for classification of classes 0-4 in CIFAR10), make only the dense layers to be trainable and conv layers to be non-trainable

In [0]:
for layer in model.layers:
  if('dense' not in layer.name):
    #Freezing a layer
    layer.trainable = False

### 5. Utilize the the model trained on CIFAR 10 (classes 0 to 4) to classify the classes 5 to 9 of CIFAR 10  (Use Transfer Learning) <br>
Achieve an accuracy of more than 85% on test data

In [9]:
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x_train59, y_train59,
          batch_size=32,
          epochs=10,
          verbose=1,
          validation_data=(x_test59, y_test59))

Train on 25000 samples, validate on 5000 samples
Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f320e886748>

## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 6. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [10]:
from google.colab import files
uploaded = files.upload()

Saving tweets.csv to tweets.csv


In [0]:
import pandas as pd
import io

In [0]:
data = pd.read_csv(io.BytesIO(uploaded['tweets.csv']), encoding = "ISO-8859-1").dropna()

In [13]:
data.shape

(3291, 3)

In [14]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


### Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [0]:
def preprocess(text):
    try:
        return text.decode('ascii')
    except Exception as e:
        return ""

In [0]:
data = data[(data['is_there_an_emotion_directed_at_a_brand_or_product'] == 'Positive emotion') | (data['is_there_an_emotion_directed_at_a_brand_or_product'] == 'Negative emotion')]

In [19]:
data.shape

(3191, 3)

### 7. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [0]:
from sklearn.feature_extraction.text import CountVectorizer

In [0]:
cv = CountVectorizer()

In [22]:
cv.fit(data['tweet_text'])

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=1,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [0]:
dtm = cv.transform(data['tweet_text'])

In [24]:
dtm.shape

(3191, 5648)

In [26]:
dtm1 = dtm.toarray()
dtm1.shape

(3191, 5648)

### 8. Find number of different words in vocabulary

In [27]:
print(dtm)

  (0, 79)	1
  (0, 227)	1
  (0, 417)	2
  (0, 1290)	1
  (0, 2286)	1
  (0, 2426)	1
  (0, 2641)	1
  (0, 2663)	1
  (0, 3304)	1
  (0, 3706)	1
  (0, 4145)	1
  (0, 4619)	1
  (0, 4772)	1
  (0, 5014)	1
  (0, 5144)	1
  (0, 5230)	1
  (0, 5373)	1
  (0, 5416)	1
  (1, 152)	1
  (1, 281)	1
  (1, 347)	1
  (1, 367)	1
  (1, 417)	1
  (1, 475)	1
  (1, 1351)	1
  :	:
  (3189, 286)	1
  (3189, 302)	2
  (3189, 347)	1
  (3189, 798)	1
  (3189, 800)	1
  (3189, 1816)	1
  (3189, 1936)	2
  (3189, 2275)	1
  (3189, 2486)	1
  (3189, 2631)	1
  (3189, 2641)	1
  (3189, 2663)	1
  (3189, 3209)	1
  (3189, 3280)	1
  (3189, 4203)	1
  (3189, 4595)	1
  (3189, 4710)	1
  (3189, 4772)	1
  (3189, 4784)	1
  (3189, 5247)	1
  (3189, 5277)	1
  (3190, 1699)	1
  (3190, 2631)	1
  (3190, 2909)	1
  (3190, 4772)	1


#### Tip: To see all available functions for an Object use dir

In [28]:
dir(cv)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getstate__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__setstate__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_char_ngrams',
 '_char_wb_ngrams',
 '_check_stop_words_consistency',
 '_check_vocabulary',
 '_count_vocab',
 '_get_param_names',
 '_get_tags',
 '_limit_features',
 '_more_tags',
 '_sort_features',
 '_stop_words_id',
 '_validate_custom_analyzer',
 '_validate_params',
 '_validate_vocabulary',
 '_white_spaces',
 '_word_ngrams',
 'analyzer',
 'binary',
 'build_analyzer',
 'build_preprocessor',
 'build_tokenizer',
 'decode',
 'decode_error',
 'dtype',
 'encoding',
 'fit',
 'fit_transform',
 'fixed_vocabulary_',
 'get_feature_names',
 'get_params',
 'get_stop_words',
 'input',
 'inverse_transf

### Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [29]:
pd.value_counts(data['is_there_an_emotion_directed_at_a_brand_or_product'])

Positive emotion    2672
Negative emotion     519
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

###  Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'label'

Hint: use map on that column and give labels

In [0]:
data['label'] = data.is_there_an_emotion_directed_at_a_brand_or_product.map({'Positive emotion':1, 'Negative emotion':0})

In [31]:
data.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,label
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,0
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,1
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,1
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,0
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,1


### 9. Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [0]:
X = data['tweet_text']
Y = data['label']

In [0]:
from sklearn.model_selection import train_test_split

In [0]:
x_train, x_test, y_train, y_test = train_test_split(X, Y, random_state=1)

In [35]:
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)

(2393,)
(798,)
(2393,)
(798,)


## 10. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [0]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

In [0]:
vectorizer = CountVectorizer()

In [0]:
x_train_dtm = vectorizer.fit_transform(x_train)
x_test_dtm = vectorizer.transform(x_test)

In [39]:
print(x_train_dtm.shape)
print(x_test_dtm.shape)

(2393, 4919)
(798, 4919)


In [0]:
nb = MultinomialNB()

In [41]:
nb.fit(x_train_dtm,y_train)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [0]:
y_pred = nb.predict(x_test_dtm)

In [43]:
print(metrics.accuracy_score(y_test, y_pred))

0.8471177944862155


In [44]:
lr = LogisticRegression()
lr.fit(x_train_dtm,y_train)
y_pred_lr = lr.predict(x_test_dtm)



In [45]:
print(metrics.accuracy_score(y_test, y_pred_lr))

0.868421052631579


## 11. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [0]:
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

### Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [47]:
# include 1-grams and 2-grams
vect = CountVectorizer(ngram_range=(1, 2))
tokenize_test(vect)

Features:  24855
Accuracy:  0.8558897243107769


### 12. Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [48]:
cvect = CountVectorizer(stop_words='english')
tokenize_test(cvect)

Features:  4681
Accuracy:  0.8533834586466166


### 13. Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [49]:
cvect = CountVectorizer(stop_words='english',max_features=300)
tokenize_test(cvect)

Features:  300
Accuracy:  0.8107769423558897


### 14. Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [50]:
cvect = CountVectorizer(ngram_range=(1,2),max_features=15000)
tokenize_test(cvect)

Features:  15000
Accuracy:  0.8533834586466166


### 15. Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [51]:
cvect = CountVectorizer(ngram_range=(1,2),min_df=2)
tokenize_test(cvect)

Features:  7764
Accuracy:  0.8583959899749374
