# A New Approach With Better Assumptions 

Equipped with our previous analysis and subsequent findings, we came to the understanding that tweets are special kind of text which, thus, should be deal with in a particular way. Also, we realised that the logistic regression was giving the best results; hence, we will throughout this new approach focus mostly on it. Finally, from our previous initial EDA, word length seems to matter; we would like here to investigate further on it by making it one of the predictors.  

##Import libraries

In [None]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import collections  as mc
%load_ext autoreload
%autoreload 2
import pandas as pd 
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
sns.set_style("white")

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import spacy

from spacy import displacy
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

## Loading 

In [None]:

df_train = pd.read_csv("https://raw.githubusercontent.com/sarrab/DMML2020_COOP/main/data/cleaned_data.csv")
df_test=pd.read_csv("https://raw.githubusercontent.com/sarrab/DMML2020_COOP/main/data/test_data.csv")

## Cleaning- Rethought

This part is, actually, about our new approach towards cleaning tweets. You may find the code related under the same section in the notebook dedicated to cleaning.

### Feature creation: Word Average Length

This is from our initial EDA. As said, we are going to, in this part of our study, consider it. Instead of going with the whole word length, we thought trying a different approach towards the length of word by averaging it. Thus, this average word length will be along with the main text one of the predictors.

In [None]:

def avg_word_length(x):
    x = x.split()
    return np.mean([len(i) for i in x])

df_train['avg_word_length'] = df_train['text'].apply(avg_word_length)
df_test['avg_word_length'] = df_test['text'].apply(avg_word_length)

df_train.head(3)

Unnamed: 0.1,Unnamed: 0,id,keyword,location,text,target,avg_word_length
0,0,3738,destroyed,USA,black eye : a space battle occurred at star o ...,0,4.8125
1,1,853,bioterror,nolocation,<hashtag> world </hashtag> fedex no longer to ...,0,5.235294
2,2,10540,windstorm,"Palm Beach County, FL",reality training : train falls off elevated tr...,1,6.75


It can be either keep as such or be further normalised. We can use both the ways while modelling. That means, let us have normalised ones too. 

**Normalising the new feature**

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
# transform data
df_train['avg_word_length'] = scaler.fit_transform(df_train[['avg_word_length']])
df_test['avg_word_length'] = scaler.fit_transform(df_test[['avg_word_length']])

df_train

Unnamed: 0.1,Unnamed: 0,id,keyword,location,text,target,avg_word_length
0,0,3738,destroyed,USA,black eye : a space battle occurred at star o ...,0,0.308399
1,1,853,bioterror,nolocation,<hashtag> world </hashtag> fedex no longer to ...,0,0.352787
2,2,10540,windstorm,"Palm Beach County, FL",reality training : train falls off elevated tr...,1,0.511811
3,3,5988,hazardous,USA,<hashtag> taiwan </hashtag> grace : expect tha...,1,0.400767
4,4,6328,hostage,Australia,new isis video : isis threatens to behead croa...,1,0.462234
...,...,...,...,...,...,...,...
6466,6466,4377,earthquake,ARGENTINA,<hashtag> earthquake </hashtag> <hashtag> sism...,1,0.255050
6467,6467,3408,derail,nolocation,<user> totally agree . she is and know what bi...,0,0.174103
6468,6468,9794,trapped,nolocation,hollywood movie about trapped miners released ...,1,0.349081
6469,6469,10344,weapons,BeirutToronto,friendly reminder that the only country to eve...,1,0.280840


This is how it looks like after being normalised.

###Filling up the missing data With Tags

The code for this workaround is available in the notebook dedicated to cleaning.

###Ekphrasis: Dealing with Social Media Text 

Refer to the cleaning notebook


##Feature Engineering

From our previous models and assumptions, we did consider all the parameters as important and build our models taking them into account as predictors. In this new approach, we still have the same intuitions. In our earlier analysis when it came to using more than a single predictor, we merged them and then found their combined Tf-Idf.  
Whereas, here, we thought generating the Tf-Idf of each predictor and then concatenating them and finally applying a dimensionality reduction technique, PCA, on the combined matrix. 

##### Feature Selection- Train set

In [None]:
# Select features
X_txt = df_train['text'].values # the features we want to analyze
X_loc = df_train['location'].values # the features we want to analyze
X_keyw = df_train['keyword'].values # the features we want to analyze
X_avg =  df_train['avg_word_length'].values

y_train_dp = df_train['target'].values # the labels, or answers, we want to test against

##### Feature Selection- Test set

In [None]:
# Select features
X_txt_tst = df_test['text'].values # the features we want to analyze
X_loc_tst = df_test['location'].values # the features we want to analyze
X_keyw_tst = df_test['keyword'].values # the features we want to analyze
X_avg_tst =  df_test['avg_word_length'].values

####Generating 3 tf-idf for the 3 columns


In [None]:
tfidf_vectorizer_txt = TfidfVectorizer(use_idf = True, max_df = 0.95)
train_txt_tfidf = tfidf_vectorizer_txt.fit_transform(X_txt)
tfidf_vectorizer_loc = TfidfVectorizer(use_idf = True, max_df = 0.95)
train_loc_tfidf = tfidf_vectorizer_loc.fit_transform(X_loc.astype('U'))
tfidf_vectorizer_kw = TfidfVectorizer(use_idf = True, max_df = 0.95)
train_kw_tfidf = tfidf_vectorizer_kw.fit_transform(X_keyw.astype('U'))


tfidf_vectorizer_Xtst = TfidfVectorizer(use_idf = True, max_df = 0.95)
test_tfidf_txt = tfidf_vectorizer_Xtst.fit_transform(X_txt_tst)
tfidf_vectorizer_loc_tst = TfidfVectorizer(use_idf = True, max_df = 0.95)
test_tfidf_loc = tfidf_vectorizer_loc_tst.fit_transform(X_loc_tst.astype('U'))
tfidf_vectorizer_kw_tst = TfidfVectorizer(use_idf = True, max_df = 0.95)
test_tfidf_kw = tfidf_vectorizer_kw_tst.fit_transform(X_keyw_tst.astype('U'))

We do not need to vectorize the average word length since it is already a matrix of numbers.

####Concatenating Them

In [None]:
import scipy
from scipy.sparse import csr_matrix
from scipy.sparse import hstack
from scipy.sparse import vstack

diff_n_clmns = train_txt_tfidf.shape[0] - train_loc_tfidf.shape[0]
diff_n_clmns_tst = test_tfidf_txt.shape[0] - test_tfidf_loc.shape[0]

trainX_tfidf = scipy.sparse.vstack((train_loc_tfidf, csr_matrix((diff_n_clmns, train_loc_tfidf.shape[1]))))
testX_tfidf = scipy.sparse.vstack((test_tfidf_loc, csr_matrix((diff_n_clmns_tst, test_tfidf_loc.shape[1]))))

X_tfidf = hstack((train_txt_tfidf, trainX_tfidf))
X_tfidf_tst = hstack((test_tfidf_txt, testX_tfidf))


In [None]:
diff_n_clmns_1 = X_tfidf.shape[0] - train_kw_tfidf.shape[0]
diff_n_clmns_1_tst = X_tfidf_tst.shape[0] - test_tfidf_kw.shape[0]

trainX_tfidf_1 = scipy.sparse.vstack((train_kw_tfidf, csr_matrix((diff_n_clmns_1, train_kw_tfidf.shape[1]))))
tstX_tfidf_1 = scipy.sparse.vstack((test_tfidf_kw, csr_matrix((diff_n_clmns_1_tst, test_tfidf_kw.shape[1]))))

X_tfidf_1 = hstack((X_tfidf, trainX_tfidf_1))
Xtst_tfidf_1 = hstack((X_tfidf_tst, tstX_tfidf_1))


X_train_sparse = np.concatenate((X_tfidf_1.todense(), X_avg[:,None]), axis=1)
X_test_sparse = np.concatenate((Xtst_tfidf_1.todense(), X_avg_tst[:,None]), axis=1)

In [None]:
X_train_sparse.shape 

(6471, 15773)

## Classify After Deep Text Preparation

### PCA and Logistic

It is evident that the results of generating all these Tf-Idfs will explode the features space. As a workaround, we have used PCA to reduce this dimensionality. Then we apply our logictic regression model.   

#####Without Average Word length

In this first model, we did not use the average word length feature as a predictor.  

In [None]:


from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import time
from sklearn.linear_model import LogisticRegressionCV

# Define Scaler
scaler = StandardScaler()
# transform data
X_train_scaled = scaler.fit_transform(X_tfidf_1.todense())
X_test_scaled = scaler.fit_transform(Xtst_tfidf_1.todense())


# Define PCA
pca = PCA(n_components=300)
pca_1 = PCA(n_components=300)

# Example on X_train_vec
X_train_pca = pca.fit_transform(X_train_scaled)
print('Shape after PCA: ', X_train_pca.shape)
print('Number of components: ', pca.n_components_)

x_tst_pca = pca_1.fit_transform(X_test_scaled)
print('Shape after PCA: ', x_tst_pca.shape)
print('Number of components: ', pca_1.n_components_)


# Fit model

log_reg_s = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state = 72)

start = time.time()
log_reg_s.fit(X_train_pca, y_train_dp)
end = time.time()
print('Time: ', round(end-start, 4))

y_pred = log_reg_s.predict(x_tst_pca)
y_pred


Shape after PCA:  (6471, 300)
Number of components:  300
Shape after PCA:  (1142, 300)
Number of components:  300
Time:  3.1056


array([0, 1, 0, ..., 1, 1, 0])

#####With All predictors: Average Word Length included

Here we make use of the concatenated Tf-Ifd which includes average word length

In [None]:
# Define Scaler
scaler = StandardScaler()
# transform data
X_train_scaled = scaler.fit_transform(X_train_sparse)
X_test_scaled = scaler.fit_transform( X_test_sparse)


# Define PCA
pca = PCA(n_components=100)
pca_1 = PCA(n_components=100)

# Example on X_train_vec
X_train_pca_1 = pca.fit_transform(X_train_scaled)
print('Shape after PCA: ', X_train_pca_1.shape)
print('Number of components: ', pca.n_components_)

x_tst_pca_1 = pca_1.fit_transform(X_test_scaled)
print('Shape after PCA: ', x_tst_pca_1.shape)
print('Number of components: ', pca_1.n_components_)


# Fit model

log_reg_s = LogisticRegressionCV(solver='saga', cv=5, max_iter=1200, random_state = 72)

start = time.time()
log_reg_s.fit(X_train_pca_1, y_train_dp)
end = time.time()
print('Time: ', round(end-start, 4))

y_pred_1 = log_reg_s.predict(x_tst_pca_1)
y_pred_1




Shape after PCA:  (6471, 100)
Number of components:  100
Shape after PCA:  (1142, 100)
Number of components:  100
Time:  89.7676


array([1, 0, 1, ..., 0, 0, 1])

#####With All predictors: Average Word Length included: With different Hyperparameters and more number of Principal Components

In [None]:
#With only Location and Text
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Define Scaler
scaler = StandardScaler()
# transform data
X_train_scaled_1 = scaler.fit_transform(X_tfidf.todense())
X_test_scaled_1 = scaler.fit_transform(X_tfidf_tst.todense())

 

# Define PCA
pca_2 = PCA(n_components=150)
pca_3 = PCA(n_components=150)

# Example on X_train_vec
X_train_pca_2 = pca_2.fit_transform(X_train_scaled_1)
print('Shape after PCA: ', X_train_pca_2.shape)
print('Number of components: ', pca_2.n_components_)

x_tst_pca_2 = pca_3.fit_transform(X_test_scaled_1)
print('Shape after PCA: ', x_tst_pca_2.shape)
print('Number of components: ', pca_3.n_components_)


start = time.time()
log_reg_rdc_2 = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000, random_state = 72)

log_reg_rdc_2.fit(X_train_pca_2, y_train_dp)
end = time.time()
print('Time: ', round(end-start, 4))

y_pred_real_rdc_2 = log_reg_rdc_2.predict(x_tst_pca_2)
y_pred_real_rdc_2

Shape after PCA:  (6471, 150)
Number of components:  150
Shape after PCA:  (1142, 150)
Number of components:  150
Time:  2.168


array([0, 0, 0, ..., 0, 1, 0])

#CONCLUSIONS


Given a chance to crosscheck the above predictions against the real ones, we strongly believe that our accuracies would have been improve. 
Yet, we could have, for instance, kept a part of our Trainset as a simulation of Test set.

Another thing, we have realised, here, is that the chosen feature 'average word length' is not that meaningful. Because, after averaging both the distributions do not appear significantly different. Thus, it would have been better to use other features that we explored during our initial EDA, for instance, 'total character' or 'word length'.  