<a href="https://colab.research.google.com/github/spazznolo/goalie-consistency/blob/main/dsf_stc_exam.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [124]:
import pandas as pd
import numpy as np
import re, string
import matplotlib.pyplot as plt
import nltk
from sklearn.metrics import accuracy_score, precision_score, f1_score, recall_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from nltk.stem import PorterStemmer

In [9]:
train = pd.read_csv('train.csv')

# **Exploring the Dataset**

Outside of the target, which is binary (Disaster or not), all other relevant variables are characters. The main variable is the tweet itself, in the 'text' variable. Outside of that, there is a keyword variable (available for all but 0.8% of tweets) and a location variable (available for 66.7% of tweets).

In [10]:
train.isnull().mean(axis = 0)

id          0.000000
keyword     0.008013
location    0.332720
text        0.000000
target      0.000000
dtype: float64

#**Pre-processing Ideas**

It is important to understand the context of the text being analyzed before deciding on the pre-processing steps. This analysis involves public, short-form communication on the popular social media website Twitter through the form of Tweets. There are textual patterns which are particular to Twitter, such as:


*   Mentions (@username)
*   Hashtags (#subject)
*   Retweets 



In [101]:
X = train.drop(['target'], axis=1)
y = train['target']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=33)
ps = PorterStemmer()

# **Task 1: Bag of words model**

The first model, which is also the simplest, is built through the text processing method called Bag-of-Words, where the number of occurences of a given word 

In [134]:
# Function which preprocesses text for bag of words model
def preprocess_text(text):
    # removes encoding quirks
    text = re.sub(r'&[a-zA-Z]+;?', '', text)
    # removes mentions
    text = re.sub(r'@[a-zA-Z_]+;?', '', text)
    # removes mentions
    #text = re.sub(r'\x[a-zA-Z_]+;?', '', text)
    # removes numbers
    text = re.sub(r'\w*\d+\w*', '', text)
    # make all text lowercase
    text = text.lower()
    # stem text
    return text


In [135]:
vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 1))

X_train_vec = vectorizer.fit_transform(X_train['text'].apply(preprocess_text))
X_train_vec = pd.DataFrame(X_train_vec.toarray(), columns=vectorizer.get_feature_names_out(), index=X_train.index)

X_val_vec = vectorizer.transform(X_val['text'].apply(preprocess_text))
X_val_vec = pd.DataFrame(X_val_vec.toarray(), columns=vectorizer.get_feature_names_out(), index=X_val.index)

In [136]:
# Create an instance of LogisticRegression classifier
lr = LogisticRegression(random_state=33)
lr.fit(X_train_vec, y_train)
y_pred = lr.predict(X_val_vec)
  
# Use metrics.accuracy_score to measure the score
precision = precision_score(y_val, y_pred)
fscore = f1_score(y_val, y_pred)
accuracy = accuracy_score(y_val, y_pred)

In [137]:
print(precision, fscore, accuracy)


0.8069164265129684 0.7301173402868317 0.782563025210084


# **Task 2: Feature generation and traditional ML model**

Now that the basics are covered, we'll test a more modern, complex, and memory-intensive approach. This will be done by first processing the data through a Term Frequency - Inverse Document Frequency (tf-idf) vectorizer.

In [16]:
# instantiate the vectorizer
tfidf = TfidfVectorizer(ngram_range=(1,2), stop_words='english')

# fit and transform
X_train_tfidf = tfidf.fit_transform(X_train['text'].apply(preprocess_text))
X_train_tfidf = pd.DataFrame(X_train_tfidf.toarray(), columns = tfidf.get_feature_names_out(), index=X_train.index)

# fit and transform
X_val_tfidf = tfidf.transform(X_val['text'].apply(preprocess_text))
X_val_tfidf = pd.DataFrame(X_val_tfidf.toarray(), columns = tfidf.get_feature_names_out(), index=X_val.index)

### **Mining Other Predictors**

Twitter is used for a wide variety of purposes. It is used to disseminate official information to the populace, to debate and discuss rapidly evolving news stories, for the creation/dissemination of art, for entertainment among friends. Through all of these use case, the communication tends to have unique stylistic choices. As an example, official information maybe be written more formally, with care taken to follow linguistic conventions, whereas informal communication among friends maybe be shorter and less conventional. It is along these lines of thought that the following predictors were created.


*   Tweet length (number of characters in Tweet)
*   Average Word Length (not counting stop words)
*   Location (binary, location of person when they Tweeted)
*   Hashtags (integer, usually used to contribute to a current trend)
*   Keyword (binary, seems to be pulled directly from tweet)




In [17]:
X_train_tfidf['keyword_binary'] = np.where(X_train['keyword'].isnull(), 0, 1)
X_val_tfidf['keyword_binary'] = np.where(X_val['keyword'].isnull(), 0, 1)

X_train_tfidf['location_binary'] = np.where(X_train['location'].isnull(), 0, 1)
X_val_tfidf['location_binary'] = np.where(X_val['location'].isnull(), 0, 1)

X_train_tfidf['length'] = X_train['text'].str.len()
X_val_tfidf['length'] = X_val['text'].str.len()

X_train_tfidf['words'] = X_train['text'].str.count(' ') + 1
X_val_tfidf['words'] = X_val['text'].str.count(' ') + 1

X_train_tfidf['avg_word_length'] = X_train_tfidf['length']/X_train_tfidf['words'] 
X_val_tfidf['avg_word_length'] = X_val_tfidf['length']/X_val_tfidf['words'] 

X_train_tfidf['hash_avg'] = X_train['text'].str.count('#')/X_train_tfidf['words'] 
X_val_tfidf['hash_avg'] = X_val['text'].str.count('#')/X_val_tfidf['words']

X_train_tfidf['mention_avg'] = X_train['text'].str.count('@')/X_train_tfidf['words'] 
X_val_tfidf['mention_avg'] = X_val['text'].str.count('@')/X_val_tfidf['words']

X_train_tfidf['upper_avg'] = X_train['text'].str.findall(r'[A-Z]').str.len()/X_train_tfidf['words'] 
X_val_tfidf['upper_avg'] = X_val['text'].str.findall(r'[A-Z]').str.len()/X_val_tfidf['words']

X_train_tfidf['symbol_avg'] = X_train['text'].str.count(r'[^a-zA-Z0-9 ]')/X_train_tfidf['length'] 
X_val_tfidf['symbol_avg'] = X_val['text'].str.count(r'[^a-zA-Z0-9 ]')/X_val_tfidf['length']

X_train_tfidf['http_avg'] = X_train['text'].str.count('http')/X_train_tfidf['words'] 
X_val_tfidf['http_avg'] = X_val['text'].str.count('http')/X_val_tfidf['words']

In [21]:
X_train_tfidf.iloc[:, 45885:45898].mean()

ûótech               0.000048
ûótech business      0.000048
ûówe                 0.000040
ûówe work            0.000040
keyword_binary       0.991943
location_binary      0.662112
length             100.844806
avg_word_length      7.098658
hash_avg             0.035803
mention_avg          0.029172
upper_avg            0.742539
symbol_avg           0.071311
http_avg             0.050235
dtype: float64

### **Training and evaluating traditional ML model**
Now that the text has been processed using tf-idf, and a handful of other features were derived using our knowledge of Tweets and disasters, we train a traditional ML model with the new model set. Remember, the goal is to determine if a given Tweet is reffering to a real disaster or not - this is a classification problem. Therefore, our choice of learning algorithm is restricted to classifiers.

In [18]:
# Build and train traditional ML model
clf = LinearSVC(random_state=33)
rf_model = clf.fit(X_train_tfidf, y_train)

# Classify observations in the validation set
y_pred = rf_model.predict(X_val_tfidf)

# Calculate performance metrics
precision = precision_score(y_val, y_pred)
fscore = f1_score(y_val, y_pred)
accuracy = accuracy_score(y_val, y_pred)



In [19]:
print(precision, fscore, accuracy)

0.9583333333333334 0.10360360360360361 0.5819327731092437


In [None]:


# create the figure
fig = plt.figure(figsize=(20, 20))

# adjust the height of tLinearSVChe padding between subplots to avoid overlapping
plt.subplots_adjust(hspace=0.3)

# add a centered suptitle to the figure
plt.suptitle("Difference in Features, Disaster vs. Non-disaster", fontsize=20, y=0.91)

# add a new subplot iteratively
ax = plt.subplot(4, 3, 1)
ax = train[train['keyword_binary']==0]['target'].hist(alpha=0.5, label='Non-disaster', bins=40, color='royalblue', density=True)
ax = train[train['keyword_binary']==1]['target'].hist(alpha=0.5, label='Disaster', bins=40, color='lightcoral', density=True)

# set x_label, y_label, and legend
ax.set_xlabel('keyword', fontsize=14)
ax.set_ylabel('Probability Density', fontsize=14)
ax.legend(loc='upper right', fontsize=14)
    
plt.show()

In [3]:
from gensim.models import Word2Vec
import gensim.downloader

glove_vectors = gensim.downloader.load('glove-twitter-25')

vocab = glove_vectors.vocab.keys()
sentence = ["london fog", "is", "the", "capital", "of", "great", "britain"]
vectors=[]
for w in sentence:
    if w in vocab:
        vectors.append(glove_vectors[w])
    else:
        print("Word {} not in vocab".format(w))
        vectors.append([0])

print(vectors)

#!pip install sentence_transformers
#from sentence_transformers import SentenceTransformer
#sbert_model = SentenceTransformer('bert-base-nli-mean-tokens')

#sentence_embeddings = sbert_model.encode(X_train['text'].tolist())