## Question 1 (Binary Classification)

How many training and test data points are there?
- training data (7613,5) -> 7613 data points
- test data (3263,4) -> 3263 data points

What percentage of the training tweets are of real disasters, and what percentage is not?
42% are real disasters, 58% are not real disasters

In [5]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

### Load Dataset

##### Learn shape and number of data points for train data

In [6]:
disaster_pd = pd.read_csv("train.csv")
disaster_pd.shape

(7613, 5)

##### Find percentage of real and not real disasters

In [7]:
disaster_pd['target'].isnull().sum() ### No null values

0

In [8]:
real_disasters = disaster_pd['target'].sum() ### Number of real disasters

In [9]:
real_disasters / 7613

0.4296597924602653

### Split the training data

##### Seperate target data from training set

In [10]:
y = disaster_pd['target']

In [11]:
X = disaster_pd.drop(columns = ['target'])

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

In [13]:
X_train

Unnamed: 0,id,keyword,location,text
1186,1707,bridge%20collapse,,Ashes 2015: AustraliaÛªs collapse at Trent Br...
4071,5789,hail,"Carol Stream, Illinois",GREAT MICHIGAN TECHNIQUE CAMP\nB1G THANKS TO @...
5461,7789,police,Houston,CNN: Tennessee movie theater shooting suspect ...
5787,8257,rioting,,Still rioting in a couple of hours left until ...
7445,10656,wounds,Lake Highlands,Crack in the path where I wiped out this morni...
...,...,...,...,...
5226,7470,obliteration,Merica!,@Eganator2000 There aren't many Obliteration s...
5390,7691,panic,,just had a panic attack bc I don't have enough...
860,1242,blood,,Omron HEM-712C Automatic Blood Pressure Monito...
7603,10862,,,Officials say a quarantine is in place at an A...


### Preprocess Data

#### Convert all the words to lowercase

In [14]:
X_train['text'] = X_train['text'].str.lower()
X_train['location'] = X_train['location'].str.lower()

In [15]:
X_train

Unnamed: 0,id,keyword,location,text
1186,1707,bridge%20collapse,,ashes 2015: australiaûªs collapse at trent br...
4071,5789,hail,"carol stream, illinois",great michigan technique camp\nb1g thanks to @...
5461,7789,police,houston,cnn: tennessee movie theater shooting suspect ...
5787,8257,rioting,,still rioting in a couple of hours left until ...
7445,10656,wounds,lake highlands,crack in the path where i wiped out this morni...
...,...,...,...,...
5226,7470,obliteration,merica!,@eganator2000 there aren't many obliteration s...
5390,7691,panic,,just had a panic attack bc i don't have enough...
860,1242,blood,,omron hem-712c automatic blood pressure monito...
7603,10862,,,officials say a quarantine is in place at an a...


In [16]:
X_test['text'] = X_test['text'].str.lower()
X_test['location'] = X_test['location'].str.lower()

#### Lemmatize all the words

In [17]:
#X_train['text'] = X_train['text'].apply(lemmatize_func)

In [18]:
#X_train.head()

In [19]:
from nltk.stem import PorterStemmer
porter = PorterStemmer()

def lemmatize_func(text):
    words = text.split()
    lemmatize_words = [porter.stem(word) for word in words]
    return ' '.join(lemmatize_words) ### essentially returning lemmatized words into sentence

In [20]:
X_train['text'] = X_train['text'].apply(lemmatize_func)

In [21]:
X_train.head()

Unnamed: 0,id,keyword,location,text
1186,1707,bridge%20collapse,,ash 2015: australiaûª collaps at trent bridg ...
4071,5789,hail,"carol stream, illinois",great michigan techniqu camp b1g thank to @bmu...
5461,7789,police,houston,cnn: tennesse movi theater shoot suspect kill ...
5787,8257,rioting,,still riot in a coupl of hour left until i hav...
7445,10656,wounds,lake highlands,crack in the path where i wipe out thi morn du...


In [22]:
X_test['text'] = X_test['text'].apply(lemmatize_func)

In [23]:
X_test.head()

Unnamed: 0,id,keyword,location,text
2644,3796,destruction,,so you have a new weapon that can caus un-imag...
2227,3185,deluge,,the f$&amp;@ thing i do for #gishwh just got s...
5448,7769,police,uk,dt @georgegalloway: rt @galloway4mayor: ûïthe...
132,191,aftershock,,aftershock back to school kick off wa great. i...
6845,9810,trauma,"montgomery county, md",in respons to trauma children of addict develo...


#### Strip punctuation

Credit: https://www.geeksforgeeks.org/python-remove-punctuation-from-string/

In [24]:
import string

def strip_punctuation(text):
    return text.translate(str.maketrans('', '', string.punctuation))


In [25]:
X_train['text'] = X_train['text'].apply(strip_punctuation)
X_test['text'] = X_test['text'].apply(strip_punctuation)

In [26]:
X_train

Unnamed: 0,id,keyword,location,text
1186,1707,bridge%20collapse,,ash 2015 australiaûª collaps at trent bridg a...
4071,5789,hail,"carol stream, illinois",great michigan techniqu camp b1g thank to bmur...
5461,7789,police,houston,cnn tennesse movi theater shoot suspect kill b...
5787,8257,rioting,,still riot in a coupl of hour left until i hav...
7445,10656,wounds,lake highlands,crack in the path where i wipe out thi morn du...
...,...,...,...,...
5226,7470,obliteration,merica!,eganator2000 there arent mani obliter server b...
5390,7691,panic,,just had a panic attack bc i dont have enough ...
860,1242,blood,,omron hem712c automat blood pressur monitor st...
7603,10862,,,offici say a quarantin is in place at an alaba...


#### Strip stop words

Reference: https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

In [27]:
### Download stop words from nltk
import nltk
from nltk.corpus import stopwords

nltk.download('stopwords')
stop_words = stopwords.words('english')

def stop_words_removal(df, column):
    return df[column].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\risha\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [28]:
X_train['text'] = stop_words_removal(X_train, 'text')
X_test['text'] = stop_words_removal(X_test, 'text')

In [29]:
X_train

Unnamed: 0,id,keyword,location,text
1186,1707,bridge%20collapse,,ash 2015 australiaûª collaps trent bridg amon...
4071,5789,hail,"carol stream, illinois",great michigan techniqu camp b1g thank bmurph1...
5461,7789,police,houston,cnn tennesse movi theater shoot suspect kill p...
5787,8257,rioting,,still riot coupl hour left class
7445,10656,wounds,lake highlands,crack path wipe thi morn dure beach run surfac...
...,...,...,...,...
5226,7470,obliteration,merica!,eganator2000 arent mani obliter server alway l...
5390,7691,panic,,panic attack bc dont enough money drug alcohol...
860,1242,blood,,omron hem712c automat blood pressur monitor st...
7603,10862,,,offici say quarantin place alabama home possib...


#### Strip @ and urls

Reference: https://github.com/sugatagh/Natural-Language-Processing-with-Disaster-Tweets/blob/main/natural-language-processing-with-disaster-tweets.ipynb

In [72]:
import re

def clean_text(text):
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE)
    text = re.sub(r'\b[a-zA-Z0-9]{10,}\b', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

X_train['text_clean'] = X_train['text'].apply(clean_text)
X_test['text_clean'] = X_test['text'].apply(clean_text)
    

#### Extra Cleaning Steps

##### Remove Non_Ascii Characters

In [73]:
def remove_non_ascii(text):
    return re.sub(r'[^\x00-\x7F]+', '', text)

X_train['text_clean'] = X_train['text_clean'].apply(remove_non_ascii)
X_test['text_clean'] = X_test['text_clean'].apply(remove_non_ascii)

In [74]:
X_train

Unnamed: 0,id,keyword,location,text,text_clean
1186,1707,bridge%20collapse,,ash 2015 australiaûª collaps trent bridg amon...,ash 2015 australia collaps trent bridg among w...
4071,5789,hail,"carol stream, illinois",great michigan techniqu camp b1g thank bmurph1...,great michigan techniqu camp b1g thank termn8r...
5461,7789,police,houston,cnn tennesse movi theater shoot suspect kill p...,cnn tennesse movi theater shoot suspect kill p...
5787,8257,rioting,,still riot coupl hour left class,still riot coupl hour left class
7445,10656,wounds,lake highlands,crack path wipe thi morn dure beach run surfac...,crack path wipe thi morn dure beach run surfac...
...,...,...,...,...,...
5226,7470,obliteration,merica!,eganator2000 arent mani obliter server alway l...,arent mani obliter server alway like play
5390,7691,panic,,panic attack bc dont enough money drug alcohol...,panic attack bc dont enough money drug alcohol...
860,1242,blood,,omron hem712c automat blood pressur monitor st...,omron hem712c automat blood pressur monitor st...
7603,10862,,,offici say quarantin place alabama home possib...,offici say quarantin place alabama home possib...


##### Remove Numbers

In [75]:
X_train['text_clean'] = X_train['text_clean'].str.replace(r'\d+', '', regex=True)
X_test['text_clean'] = X_test['text_clean'].str.replace(r'\d+', '', regex=True)


In [76]:
X_train

Unnamed: 0,id,keyword,location,text,text_clean
1186,1707,bridge%20collapse,,ash 2015 australiaûª collaps trent bridg amon...,ash australia collaps trent bridg among worst...
4071,5789,hail,"carol stream, illinois",great michigan techniqu camp b1g thank bmurph1...,great michigan techniqu camp bg thank termnr g...
5461,7789,police,houston,cnn tennesse movi theater shoot suspect kill p...,cnn tennesse movi theater shoot suspect kill p...
5787,8257,rioting,,still riot coupl hour left class,still riot coupl hour left class
7445,10656,wounds,lake highlands,crack path wipe thi morn dure beach run surfac...,crack path wipe thi morn dure beach run surfac...
...,...,...,...,...,...
5226,7470,obliteration,merica!,eganator2000 arent mani obliter server alway l...,arent mani obliter server alway like play
5390,7691,panic,,panic attack bc dont enough money drug alcohol...,panic attack bc dont enough money drug alcohol...
860,1242,blood,,omron hem712c automat blood pressur monitor st...,omron hemc automat blood pressur monitor stand...
7603,10862,,,offici say quarantin place alabama home possib...,offici say quarantin place alabama home possib...


### Bags of words model

In [77]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer(binary=True, min_df = 5)
X_train_bag = count_vect.fit_transform(X_train['text_clean']) #array of words seen in each tweet
X_test_bag = count_vect.transform(X_test['text_clean'])

In [78]:
X_train_bag.shape

(5329, 1825)

In [79]:
X_test_bag.shape

(2284, 1825)

In [80]:
vocab_size = len(count_vect.get_feature_names_out())
print(f"Total number of features: {vocab_size}")

Total number of features: 1825


### Logistic Regression

#### Train logistic model without L2 regularization

In [81]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score

logreg = LogisticRegression(penalty='none', verbose=True, max_iter=1000)

logreg.fit(X_train_bag, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


LogisticRegression(max_iter=1000, penalty='none', verbose=True)

#### Pedict on training dataset 

In [82]:
y_train_pred = logreg.predict(X_train_bag)
f1_train = f1_score(y_train, y_train_pred)            

#### F1 score in training set

In [83]:
f1_train

0.9718928414580589

#### Predict on holdout set

In [84]:
y_test_pred = logreg.predict(X_test_bag)
f1_test = f1_score(y_test, y_test_pred)  

#### F1 score in development set

In [85]:
f1_test

0.6610837438423645

We beleive this model experiences overfitting. As we can see from our F1 scores, the F1 score for the training set is much higher than that of the holdout set. The model on the training set performed very well with high precision and recall (roughly 0.98). However, the test F1 score was much lower at approximatley 0.67. 

#### Logistic Regression Model with L1 regularization

Reference: https://scikit-learn.org/stable/auto_examples/linear_model/plot_logistic_path.html

In [86]:
#logistic regression model with L1 regularization

log_reg_l1 = LogisticRegression(penalty='l1', solver='liblinear', max_iter=1000, C = 1)

log_reg_l1.fit(X_train_bag, y_train)

y_train_pred_l1 = log_reg_l1.predict(X_train_bag)

y_test_pred_l1 = log_reg_l1.predict(X_test_bag)

Note: Different C-values were implemented to determine the more optimal strength of the regualizer. C = 1 outputted the best F1 score for the test data

#### F1 score for training data

In [87]:
f1_train_l1 = f1_score(y_train, y_train_pred_l1)
f1_train_l1

0.8279694373697615

#### F1 score for test data

In [88]:
f1_test_l1 = f1_score(y_test, y_test_pred_l1)
f1_test_l1

0.7445414847161572

The performance of the logistic regression with the L1 regualizer was much better than the previous model without a regularization. The F1 score for the training set was not as high, and the F1 score for the test data become much increased as well. Overall, the performance of the model was fairly close for both the training and development set. However, the model is still slight overfitting the training set. 

#### Logisitc regression with L2 Regularizer

In [89]:
log_reg_l2 = LogisticRegression(penalty='l2', solver='liblinear', max_iter=1000)

In [90]:
log_reg_l2.fit(X_train_bag,y_train)

LogisticRegression(max_iter=1000, solver='liblinear')

In [91]:
y_train_pred_l2 = log_reg_l2.predict(X_train_bag)
y_test_pred_l2 = log_reg_l2.predict(X_test_bag)

#### F1 score for training

In [92]:
f1_train_l2 = f1_score(y_train, y_train_pred_l2)
f1_train_l2

0.8516129032258064

#### F1 score for test data

In [93]:
f1_test_l2 = f1_score(y_test, y_test_pred_l2)
f1_test_l2

0.7438558164937193

The Logistic Regression model with the L2 regularizer had roughly the same performance as the L1 regularization. The performance was slightly better for the test data.

The logistic regression with no regularization was very overfitting.The training performance score of a 0.91 was very high compared to the 0.67 F1 score for the test data, which was the lowest out of all 3 of the classifiers.

The L1 reglarizer was better with a F1 training score of 0.83 and a F1 test score of 0.74. The L2 regularization had a slightly better training F1 score of 0.85, but ultimately had the same performance when predicting the test data with an F1 score of 0.74. 

Note: The word test and development set is being used interchangably. Ultimately, we are implementing these models on the development set. The development set is 30% of the training data, and was done during the "Split Training Data" step.

#### Inspect weight vector with L1 regularization

Reference: https://scikit-learn.org/dev/modules/generated/sklearn.linear_model.LinearRegression.html

In [102]:
weights = log_reg_l1.coef_[0]

# Gather words from vectorizer (vocabulary list)
features = np.array(count_vect.get_feature_names_out())

# Sort by weights (needed help from ChatGPT for this step)
sort_pos = np.argsort(weights)[::-1]  # descending order
sort_neg = np.argsort(weights)  # ascending order


features_pos = features[sort_pos[:5]]
weights_pos = weights[sort_pos[:5]]


features_neg = features[sort_neg[:5]]
weights_neg = weights[sort_neg[:5]]

# Print the positive weights 
print("Top Positive Words:")
for word, weight in zip(features_pos, weights_pos):
    print(f"Word: {word}, Weight: {weight}")

# Print the negative weights 
print("\nTop Negative Words:")
for word, weight in zip(features_neg, weights_neg):
    print(f"Word: {word}, Weight: {weight}")

Top Positive Words:
Word: spill, Weight: 4.309336959881795
Word: hiroshima, Weight: 3.789952411469966
Word: outbreak, Weight: 3.631030479972402
Word: earthquak, Weight: 3.3409816274774964
Word: migrant, Weight: 3.228607970985584

Top Negative Words:
Word: ebay, Weight: -2.63399043757554
Word: attempt, Weight: -2.1657340670075484
Word: charact, Weight: -2.09519684195879
Word: hurt, Weight: -2.0222904803956805
Word: better, Weight: -1.8857716756476268


Inspecting the weight vector of the trained model, we printed out the top most influential words based on their weights (both negative and positive correlations). We notice that words such as "spill" and "hiroshima" are most likely found in disaster tweets. Furthermore, the negative wwight implies the model takes in words such as "ebay" abd "attempt" and classifies these tweets more towards "not-disasters". The words in the list above are typically associated with disaster or non-distaser scenarios. Therefore, the model performed as expected.

### Bernoulli Naive Bayes

#### Compute the maximum likelihood model parameters

In [149]:
n = X_train_bag.shape[0] #n size of data  set
d = X_train_bag.shape[1] #d number of features (words)
k = 2 # number of classes

#shape of parameters
psis = np.zeros([k,d])
phis = np.zeros([k])

#compute parameters kd

alpha = 1
for k in range(k):
    X_k = X_train_bag[y_train == k]
    psis[k] = (X_k.sum(axis=0) + alpha) / (X_k.shape[0] + (2 * alpha)) #Laplace smoothing
    phis[k] = X_k.shape[0] / float(n)

# print out the class proportions
print(psis, '\n'*2, phis)

    


[[0.00231328 0.00099141 0.00066094 ... 0.00033047 0.00198282 0.00528751]
 [0.00130039 0.00303424 0.00866927 ... 0.00303424 0.00086693 0.00346771]] 

 [0.56746106 0.43253894]


#### Compute predictions using Baye's Rule

No need to reshape as there are only 2 classes, making the problem much simpler

In [175]:
def nb_predictions(x, psis, phis): #returns class assignments
    n = x.shape[0]
    k = 2 # set on binary classification
    
    logpyx = np.zeros((n, k)) # creates a new array: n x k
    
    x = x.toarray() # need to convert sparse matrix to numpy array (error fixed by Chatgpt)
    
    # clip probabilities to avoid log(0)
    psis = psis.clip(1e-14, 1-1e-14)
    
    #compute log probabilities
    
    for k in range(2):
        logpy = np.log(phis[k]) #log probability of prior distribution
        
        # how likely words in tweets belong to class k
        logpxy = (x * np.log(psis[k])) + ((1 - x) * np.log(1 - psis[k]))
        
        logpyx[:, k] = logpy + np.sum(logpxy, axis=1) # sum log-likelihood
        
    return np.argmax(logpyx, axis=1)

#### Create function to calculate f1 scores

In [179]:
def f1_score(y_test, y_pred):
    tp = np.sum((y_test == 1) & (y_pred == 1)) # True Positive
    tn = np.sum((y_test == 0) & (y_pred == 0)) # True negative
    fp = np.sum((y_test == 0) & (y_pred == 1)) # False positive
    fn = np.sum((y_test == 1) & (y_pred == 0)) # False negative
    
    #Calcualte precision
    
    if fp >= 0:
        p = tp /(tp + fp) # precision
    else:
        p = 0
    
    if fn >= 0:
        r = tp / (tp +fn) # recall
    else:
        r = 0
    
    return 2* (p*r) / (p + r)

#### Predict whether a tweet is in class (disaster, not disaster) on development test

In [180]:
y_pred = nb_predictions(X_test_bag, psis, phis)

#### Compute f1 score of development test

In [181]:
f1_score(y_test, y_pred)

0.7515563101301642

### Model Comparison

### N - gram model

### Determine Performance with the test set