<a href="https://colab.research.google.com/github/spazznolo/goalie-consistency/blob/main/dsf_stc_exam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [41]:
# Load data manipulation modules
import pandas as pd
import numpy as np

# Load data visualization modules
import matplotlib.pyplot as plt

# Load text manipulation modules
import re, string
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
nltk.download('stopwords')
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Load machine learning modules
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier

# Load pre-trained word embedding modules
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer

# Load datset
train = pd.read_csv('train.csv')

# **Exploring the Dataset**

Outside of the target, which is binary (Disaster or not), all other relevant variables are characters. The main variable is the tweet itself, in the 'text' variable. Outside of that, there is a keyword variable (available for all but 0.8% of tweets) and a location variable (available for 66.7% of tweets).

In [43]:
train.isnull().mean(axis = 0)

id          0.000000
keyword     0.008013
location    0.332720
text        0.000000
target      0.000000
dtype: float64

In [55]:
train['target'].mean()

0.4296597924602653

In [44]:
X = train.drop(['target'], axis=1)
y = train['target']

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=33)

# **Task 1: Bag of words model**

The first model, which is also the simplest, is built through the text processing method called Bag-of-Words, where the number of occurences of a given word 

In [46]:
# Assign stemmer to object
ps = PorterStemmer()

# Function which preprocesses text in general format for all models
def preprocess_text(text):

    # removes encoding quirks
    text = re.sub(r'&[a-zA-Z]+;?', '', text)
    # removes mentions
    text = re.sub(r'@[a-zA-Z_]+;?', '', text)
    # removes numbers
    text = re.sub(r'\w*\d+\w*', '', text)
    # make all text lowercase
    text = text.lower()

    # tokenizes text
    text = nltk.word_tokenize(text)
    # removes stop words
    text = [i for i in text if i not in stopwords.words('english')]
    # removes punctuation
    text = [i for i in text if i not in string.punctuation]
    # stems alphanumeric values
    text = [ps.stem(i) for i in text if i.isalnum() == True]
    
    return " ".join(text)


In [47]:
vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 1))

X_train_vec = vectorizer.fit_transform(X_train['text'].apply(preprocess_text))
X_train_vec = pd.DataFrame(X_train_vec.toarray(), columns=vectorizer.get_feature_names_out(), index=X_train.index)

X_val_vec = vectorizer.transform(X_val['text'].apply(preprocess_text))
X_val_vec = pd.DataFrame(X_val_vec.toarray(), columns=vectorizer.get_feature_names_out(), index=X_val.index)

In [48]:
# Create an instance of LogisticRegression classifier
lr = LogisticRegression(random_state=33)
lr.fit(X_train_vec, y_train)
y_pred = lr.predict(X_val_vec)
  
# Use metrics.accuracy_score to measure the score
precision = precision_score(y_val, y_pred)
fscore = f1_score(y_val, y_pred)
accuracy = accuracy_score(y_val, y_pred)

In [49]:
print(precision, fscore, accuracy)

0.7997159090909091 0.7292746113989638 0.7804621848739496


In [None]:
# Add plots

# **Task 2: Feature generation and traditional ML model**

Now that the basics are covered, we'll test a more modern, complex, and memory-intensive approach. This will be done by first processing the data through a Term Frequency - Inverse Document Frequency (tf-idf) vectorizer.

In [50]:
# instantiate the vectorizer
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words='english')

# fit and transform
X_train_tfidf = tfidf.fit_transform(X_train['text'].apply(preprocess_text))
X_train_tfidf = pd.DataFrame(X_train_tfidf.toarray(), columns = tfidf.get_feature_names_out(), index=X_train.index)

# fit and transform
X_val_tfidf = tfidf.transform(X_val['text'].apply(preprocess_text))
X_val_tfidf = pd.DataFrame(X_val_tfidf.toarray(), columns = tfidf.get_feature_names_out(), index=X_val.index)

### **Mining Other Predictors**

Twitter is used for a wide variety of purposes. It is used to disseminate official information to the populace, to debate and discuss rapidly evolving news stories, for the creation/dissemination of art, for entertainment among friends. Through all of these use case, the communication tends to have unique stylistic choices. As an example, official information maybe be written more formally, with care taken to follow linguistic conventions, whereas informal communication among friends maybe be shorter and less conventional. It is along these lines of thought that the following predictors were created.


*   Tweet length (number of characters in Tweet)
*   Average Word Length (not counting stop words)
*   Location (binary, location of person when they Tweeted)
*   Hashtags (integer, usually used to contribute to a current trend)
*   Keyword (binary, seems to be pulled directly from tweet)




In [51]:
X_train_tfidf['keyword_binary'] = np.where(X_train['keyword'].isnull(), 0, 1)
X_val_tfidf['keyword_binary'] = np.where(X_val['keyword'].isnull(), 0, 1)

X_train_tfidf['location_binary'] = np.where(X_train['location'].isnull(), 0, 1)
X_val_tfidf['location_binary'] = np.where(X_val['location'].isnull(), 0, 1)

from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()

X_train_temp = X_train.copy()
X_val_temp = X_val.copy()

X_train_temp['length'] = X_train['text'].str.len()
X_val_temp['length'] = X_val['text'].str.len()

X_train_temp['words'] = X_train['text'].str.count(' ') + 1
X_val_temp['words'] = X_val['text'].str.count(' ') + 1

X_train_temp['avg_word_length'] = X_train_temp['length']/X_train_temp['words'] 
X_val_temp['avg_word_length'] = X_val_temp['length']/X_val_temp['words'] 

X_train_temp['hash_avg'] = X_train['text'].str.count('#')/X_train_temp['words'] 
X_val_temp['hash_avg'] = X_val['text'].str.count('#')/X_val_temp['words']

X_train_temp['mention_avg'] = X_train['text'].str.count('@')/X_train_temp['words'] 
X_val_temp['mention_avg'] = X_val['text'].str.count('@')/X_val_temp['words']

X_train_temp['upper_avg'] = X_train['text'].str.findall(r'[A-Z]').str.len()/X_train_temp['words'] 
X_val_temp['upper_avg'] = X_val['text'].str.findall(r'[A-Z]').str.len()/X_val_temp['words']

X_train_temp['symbol_avg'] = X_train['text'].str.count(r'[^a-zA-Z0-9 ]')/X_train_temp['length'] 
X_val_temp['symbol_avg'] = X_val['text'].str.count(r'[^a-zA-Z0-9 ]')/X_val_temp['length']

X_train_temp['link_count'] = X_train['text'].str.count('http')
X_val_temp['link_count'] = X_val['text'].str.count('http')

X_train_temp['rt'] = X_train['text'].str.count('RT')
X_val_temp['rt'] = X_val['text'].str.count('RT')

new_predictors = ['length', 'avg_word_length', 'hash_avg', 'mention_avg', 'upper_avg', 'symbol_avg', 'link_count', 'rt']

X_train_temp_scaled = std_scaler.fit_transform(X_train_temp[new_predictors].to_numpy())
X_train_temp_scaled = pd.DataFrame(X_train_temp_scaled, 
                                   columns = new_predictors,
                                   index = X_train_tfidf.index)
X_val_temp_scaled = std_scaler.fit_transform(X_val_temp[new_predictors].to_numpy())
X_val_temp_scaled = pd.DataFrame(X_val_temp_scaled, 
                                 columns = new_predictors,
                                 index = X_val_tfidf.index)

X_train_tfidf = pd.concat([X_train_tfidf, X_train_temp_scaled], axis = 1)
X_val_tfidf = pd.concat([X_val_tfidf, X_val_temp_scaled], axis = 1)


### **Training and evaluating traditional ML model**
Now that the text has been processed using tf-idf, and a handful of other features were derived using our knowledge of Tweets and disasters, we train a traditional ML model with the new model set. Remember, the goal is to determine if a given Tweet is reffering to a real disaster or not - this is a classification problem. Therefore, our choice of learning algorithm is restricted to classifiers.

In [52]:
# Build and train traditional ML model
clf = LinearSVC(random_state=33, max_iter = 100000)
clf.fit(X_train_tfidf, y_train)

# Classify observations in the validation set
y_pred = clf.predict(X_val_tfidf)

# Calculate performance metrics
precision = precision_score(y_val, y_pred)
fscore = f1_score(y_val, y_pred)
accuracy = accuracy_score(y_val, y_pred)
print(precision, fscore, accuracy)

In [None]:
# Add plots

# **Task 3: Pre-trained word embeddings + linear classifier model**

A pre-trained word embedding is .... For this problem, I chose the BERT .... BERT is a leading-edge .... I chose a sentence embedding because we're looking for a single embedding which represents a sequence of word (Tweet). An efficient way to do this is to consider the Tweet as a sentence, and use a sentence embedder. After running the training and validation sets through the pre-trained sentence embedding, there were 768 variables.

In [None]:
# Load pre-trained embedding (using sentence because we want a single embedding which represents the sequence of words)
sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

# Encoding train, validation Tweets with pre-trained embedding
sentence_embeddings = sbert_model.encode(X_train['text'].apply(preprocess_text).tolist())
val_embeddings = sbert_model.encode(X_val['text'].apply(preprocess_text).tolist())

In [None]:
# Building, fitting SGD Classifer to newly created training set
clf = make_pipeline(StandardScaler(), SGDClassifier(loss='hinge', max_iter=10000, random_state=33))
clf.fit(sentence_embeddings, y_train)

# Classifying validation set with newly trained model
y_pred = clf.predict(val_embeddings)

# Calculating performance metrics
precision = precision_score(y_val, y_pred)
fscore = f1_score(y_val, y_pred)
accuracy = accuracy_score(y_val, y_pred)
print(precision, fscore, accuracy)

In [None]:
# create the figure
fig = plt.figure(figsize=(20, 20))

# adjust the height of tLinearSVChe padding between subplots to avoid overlapping
plt.subplots_adjust(hspace=0.3)

# add a centered suptitle to the figure
plt.suptitle("Difference in Features, Disaster vs. Non-disaster", fontsize=20, y=0.91)

# add a new subplot iteratively
ax = plt.subplot(4, 3, 1)
ax = train[train['keyword_binary']==0]['target'].hist(alpha=0.5, label='Non-disaster', bins=40, color='royalblue', density=True)
ax = train[train['keyword_binary']==1]['target'].hist(alpha=0.5, label='Disaster', bins=40, color='lightcoral', density=True)

# set x_label, y_label, and legend
ax.set_xlabel('keyword', fontsize=14)
ax.set_ylabel('Probability Density', fontsize=14)
ax.legend(loc='upper right', fontsize=14)
    
plt.show()

TASK 4 - Recommendations to the clients

Create a final plot(s) of the relevant performance metrics from each experiment.

Your job is to present this to each client, providing a recommendation to the clients, taking into consideration all of the clients wants and needs.

Explain your decisions.