# Project Code

## Introduction
As the world becomes more connected with platforms online, it is important to analyze trends as they occur in real time to understand their scope and features. This project analyzes the elements used in detecting disasters using twitter data and supervised machine learning to setermine if the event is true or not.

## Data Extraction
The first step in the Data Analysis Process is data extration. Here, the data will be extracted from a dataset and the different variables will be examined.
In this case, the data comes from a .csv dataset found on the kaggle.com website and has already been modified to contain a target outcome. To first understand the description of the data, it first needs to be imported into a pandas DataFrame object.

In [2]:
import pandas as pd # Import the pandas library as pd.
# Import the csv dataset file, and assign it to a DataFrame object called 'dataset' for further processing.
dataset = pd.read_csv("nlptrain.csv", sep = ",") # Use a comma as the delimiter inside of the dataset.
dataset # Print the contents of the dataset.

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1
...,...,...,...,...,...
7608,10869,,,Two giant cranes holding a bridge collapse int...,1
7609,10870,,,@aria_ahrary @TheTawniest The out of control w...,1
7610,10871,,,M1.94 [01:04 UTC]?5km S of Volcano Hawaii. htt...,1
7611,10872,,,Police investigating after an e-bike collided ...,1


Now that the data has been extracted and imported into a pandas DataFrame object, we can see that there are a total of five variables available, and this includes id, keyword, location, text, and target. The id column is used to provide each record with a unique primary key/identifier. The keyword column shows the keyword extracted from the text column which resembles a disaster. The location column describes the location of where the tweet or disaster occured, and the target variable is the dependent variable of this dataset which classifies whether the tweet contains a disaster or not. Here, 1 in this field represents true and 0 represents false. 

Below, the datatypes of each variable are shown, and this is important to understand, because our features need to will be required to be a number, since we are building mathematical models.

In [3]:
dataset.dtypes

id           int64
keyword     object
location    object
text        object
target       int64
dtype: object

## Data Preparation
The next step in the Data Analysis Process is to prepare the data. Here, we will need to transform the tweets into vectors of word frequency. After this stage, feature selection may be used, and in this case it will be used further below so that the results between the two methods can be compared.

To begin, the CountVectorizer class must be imported into the program from the sklearn library to use as an object.

In [4]:
from sklearn.feature_extraction.text import CountVectorizer # Import the CountVectorizer class for feature extraction.

# Create a CountVectorizer object.
vectorizer = CountVectorizer(stop_words="english")

# Transform the words in the 'text' column into vectors of word frequency, and fit the data along the x axis.
X = vectorizer.fit_transform(dataset['text'])

# Create another DataFrame object with using the extracted features above inside of a dense matrix.
df_tf = pd.DataFrame(X.todense(), columns=vectorizer.get_feature_names())

# Assign extracted features DataFrame object to the variable x and the target variable in the dataset to variable y.
x = df_tf
y = dataset["target"]

# Print the DataFrame object's contents containing the extracted features and their word frequency.
df_tf

Unnamed: 0,00,000,0000,007npen6lg,00cy9vxeff,00end,00pm,01,02,0215,...,ûò,ûò800000,ûòthe,ûòåêcnbc,ûó,ûóher,ûókody,ûónegligence,ûótech,ûówe
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7608,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7609,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7610,0,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7611,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Spliting the Data
The next step in the Data Analysis Process is to split the data into separate training and testing data sets. In this case, the traditional/conventional approach is used first by relying on the train_test_split function found in the sklearn library. Here, the data is split into two pieces by passing in the created x and y variables above. 80% of the original dataset will be used for training data, and 20% will be used for testing data. Furthermore, the results of this function are also assigned to the x and y variables used in these separate data sets.

In a later section, the cross validation method will be used to split and train the data.

In [5]:
from sklearn.model_selection import train_test_split # Import the sklearn library and the train_test_split function to split the data into two parts.
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.2, random_state=123) # Split the data at a random state to prevent overfitting and having a based outcome, and use 20% of the dataset as testing data.

Below, I printed the counts of the data inside of each set to verify that the contents of the target outcome variable is roughly stratified.

In [26]:
# Verify the target counts in the split datasets.
print("Total target counts")
print("\ntarget counts in training Y:\n",train_y.value_counts())
print("\ntarget counts testing Y:\n",test_y.value_counts())

Total target counts

target counts in training Y:
 0    3463
1    2627
Name: target, dtype: int64

target counts testing Y:
 0    879
1    644
Name: target, dtype: int64


These counts of the target variable indicate that the sampling is roughly stratified.

## Feature Selection Using the Chi-Squared Test
Although a bag of words was used to hold all of the extracted features from earlier above, it unfortunately contained a large amount of unnesseary variables that can lead to biased outcomes and overfitted models. Thankfully, we can also extract significant features in the dataset using the Chi-Squared test. Using this method of feature selection requires discrete predicter and outcome variables which are already included in our dataset.

To begin, the sklearn library needs to be imported to use the chi2() function using only the training data. Here, the features with the significant number will be used as specified by the p value. Our attributes will contain our bag of words and also use this p value of <0.05, although sometimes 0.01 may be used instead. This is because this value is actually the f value used in statistics and means that the feature is significant.

In [21]:
# Import the chi2() function and other relevant functions.
from sklearn.feature_extraction.text import *
from sklearn.feature_selection import *

# Compute the f value and p value of the chi-squared test between each attribute and the target variable.
f_val, p_val = chi2(train_x, train_y) 

# Create a new pandas DataFrame object with the newly computed scores.
df_scores = pd.DataFrame(zip(df_tf, f_val, p_val), columns=["feature", "chi2", "p"])
df_scores["chi2"] = df_scores["chi2"].round(2)
df_scores["p"] = df_scores["p"].round(3)

# Use features with p value < 0.05 (significant features).
sel_ohe_cols = df_scores[df_scores["p"]<0.05]["feature"].values
print("\nSelected features: %d" % len(sel_ohe_cols))
sel_ohe_cols


Selected features: 1104


array(['00', '01', '04', ..., 'zionist', 'zone', 'ûïwhen'], dtype=object)

## Building Different Predictive Models
### 13 Nearest Neighbors
The first type of predictive model that will be built is going to be using the 13 nearest neighbors. This type of lazy and instance-based learning will use the 13 closest training instances to predict the outcome of the target variable. To begin, the KNeighborsClassifier class must be imported from the sklean library. Then, an instance of this class must be created with the parameter n_neighbors=13 to specify the number of nearest neighbors to be used. Afterwards, the model is trained by fitting the training dataset variables, and then a predicted y variable is created using this training data. Finally, the performance of the model is calculated by importing the required accuracy_score, f1_score, precision_score, and recall_score functions from the sklearn library. These allow us to measure the performance of the model, and by combining the "pred_y, test_y" values we can compute f1 and other values.


As one can see further below, the accuracy of this predicted model is almost at 65%

In [8]:
from sklearn.neighbors import KNeighborsClassifier # Import the KNeighborsClassifier class.

knn = KNeighborsClassifier(n_neighbors=13) # Create an object from the class by calling its constructor method with the specified number of neighbors to use.

# Train the model by fitting the training dataset variables (x=independant, y=dependant).
knn = knn.fit(train_x, train_y)
# Create a predicted y variable for the model to test the model.
knn_pred_y = knn.predict(test_x)

# Evaluate the prediction results by importing the required libraries to show the statistics.
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score

# Here we can evaluate the performance of the model and compare the results to other trained models.
print("Results without feature selection:")
print("f1:" + str(f1_score(knn_pred_y, test_y)))
print("accuracy:" + str(accuracy_score(knn_pred_y, test_y)))
print("precision:" + str(precision_score(knn_pred_y, test_y)))
print("recall:" + str(recall_score(knn_pred_y, test_y)))

Results without feature selection:
f1:0.2988204456094365
accuracy:0.6487196323046619
precision:0.17701863354037267
recall:0.957983193277311


In [9]:
knn_fs = KNeighborsClassifier(n_neighbors=13)# Create a new KNeighborsClassifier object by calling its constructor method.

# Train the model by fitting the training dataset variables (x=independant(using significant features), y=dependant).
knn_fs = knn_fs.fit(train_x[sel_ohe_cols], train_y)
# Create a predicted y variable (using selected significant features) for the model to test the model.
knn_fs_pred_y = knn_fs.predict(test_x[sel_ohe_cols])

# Evaluate the prediction results and show the statistics to evaluate the performance of the model and compare the results to other trained models.
print("Results with feature selection:")
print("f1:" + str(f1_score(knn_fs_pred_y, test_y)))
print("accuracy:" + str(accuracy_score(knn_fs_pred_y, test_y)))
print("precision:" + str(precision_score(knn_fs_pred_y, test_y)))
print("recall:" + str(recall_score(knn_fs_pred_y, test_y)))

Results with feature selection:
f1:0.4139590854392299
accuracy:0.680236375574524
precision:0.2670807453416149
recall:0.9197860962566845


Here, we can see the results that included feature selection, and these statistics indicate that feature selection increased all of the values except for recall of our model.

### Decision Tree
The second type of predictive model that will be built is going to be using a decision tree classifier. This type of modeling requires the outcome variable to be discrete and is predicted using other discrete attributes. As a result, all continuous attributes will need to be discrete, and thankfully this is not a problem in our dataset.

To begin, the tree module from the sklearn library is imported to access the DecisionTreeClassifier class, then an instance of this class is created. Next, this object will be used with the training variables and a model will be created. Finally, it will be evaluated by viewing its performance.

In [10]:
# Import tree module from the sklean library.
from sklearn import tree
dt = tree.DecisionTreeClassifier() # Create the DecisionTreeClassifier object by calling its constructor method.

# Use the newly created clf object, and fit the training data along the x and y axis.
dt = dt.fit(train_x, train_y)
# Create a predicted y variable for the model to test the model.
dt_pred_y = dt.predict(test_x)

# Evaluate the prediction results and show the statistics to evaluate the performance of the model and compare the results to other trained models.
print("Results without feature selection:")
print("f1:" + str(f1_score(dt_pred_y, test_y)))
print("accuracy:" + str(accuracy_score(dt_pred_y, test_y)))
print("precision:" + str(precision_score(dt_pred_y, test_y)))
print("recall:" + str(recall_score(dt_pred_y, test_y)))

Results without feature selection:
f1:0.7049180327868854
accuracy:0.7636244254760342
precision:0.6677018633540373
recall:0.7465277777777778


In [11]:
dt_sf = tree.DecisionTreeClassifier()# Create a new DecisionTreeClassifier object by calling its constructor method.

# Use the clf object, and fit the training data (using selected significant features) along the x and y axis.
dt_sf = dt_sf.fit(train_x[sel_ohe_cols], train_y)
# Create a predicted y variable (using selected significant features) for the model to test the model.
dt_sf_pred_y = dt_sf.predict(test_x[sel_ohe_cols])

# Evaluate the prediction results and show the statistics to evaluate the performance of the model and compare the results to other trained models.
print("Results with feature selection:")
print("f1:" + str(f1_score(dt_sf_pred_y, test_y)))
print("accuracy:" + str(accuracy_score(dt_sf_pred_y, test_y)))
print("precision:" + str(precision_score(dt_sf_pred_y, test_y)))
print("recall:" + str(recall_score(dt_sf_pred_y, test_y)))

Results with feature selection:
f1:0.6810344827586207
accuracy:0.7570584372948129
precision:0.6133540372670807
recall:0.7655038759689923


As we can see above, feature selection actually decreased the f1, accuracy, and precision of this model.

### Support Vector Machine
The third type of predictive model that will be built is going to be using the Support Vector Machine. This type of modeling searches for the hyperplane that separates two classes with the maximum margin. It is known for being robust to high-dimensional data due to the kernel tricks that it provides. In this case we will use a linear hyperplane, however it also allows the parameter of polynomial-like equations to classify data points. In addition, it can also be used with different classifiers that use different parameters to "tune" the data to have better results.

To begin, the LinearSVC class must be imported from the sklearn library. Then, an instance of this class will be created with a random state to prevent having an overfitted model and biased outcome. Next, the model is trained, and the performance of the model is evaluated and displayed.

In [12]:
# Import the LinearSVC class form the sklearn library.
from sklearn.svm import LinearSVC

# Create an object of the LinearSVC class using a random state to prevent having an overfitted model.
svc = LinearSVC(random_state=123456)

# Use the clf object, and fit the training data along the x and y axis.
svc = svc.fit(train_x, train_y)
# Create a predicted y variable for the model to allow us to test the model.
svc_pred_y = svc.predict(test_x)

# Evaluate the prediction results and show the statistics to evaluate the performance of the model and compare the results to other trained models.
print("Results without feature selection:")
print ("f1:" + str(f1_score(svc_pred_y, test_y)))
print ("accuracy:" + str(accuracy_score(svc_pred_y, test_y)))
print ("precision:" + str(precision_score(svc_pred_y, test_y)))
print ("recall:" + str(recall_score(svc_pred_y, test_y)))

Results without feature selection:
f1:0.7327731092436974
accuracy:0.7912015758371634
precision:0.6770186335403726
recall:0.7985347985347986


In [13]:
# Create a new object of the LinearSVC class using a random state to prevent having an overfitted model.
svc_sf = LinearSVC(random_state=123456)

# Use the clf object, and fit the training data (with selected significant features) along the x and y axis.
svc_sf = svc_sf.fit(train_x[sel_ohe_cols], train_y)
# Create a predicted y variable (using selected significant features) for the model to allow us to test the model.
svc_sf_pred_y = svc_sf.predict(test_x[sel_ohe_cols])

# Evaluate the prediction results and show the statistics to evaluate the performance of the model and compare the results to other trained models.
print("Results with feature selection:")
print ("f1:" + str(f1_score(svc_sf_pred_y, test_y)))
print ("accuracy:" + str(accuracy_score(svc_sf_pred_y, test_y)))
print ("precision:" + str(precision_score(svc_sf_pred_y, test_y)))
print ("recall:" + str(recall_score(svc_sf_pred_y, test_y)))

Results with feature selection:
f1:0.7261698440207972
accuracy:0.7925147734734077
precision:0.6506211180124224
recall:0.8215686274509804


In this case, the results of both models have a very similar accuracy and f1 scores. Here, the model with selected features appears to have a higher accuracy and recall performance.

### Naive Bayes
The fourth type of predictive model that will be built is going to be using the Naive Bayes modeling algorithm. This type of modeling uses the Bayes theorem and is naive because of the independence assumption. As a result, it may return incorrect probability estimates if the independence assumption is violated, but it does not affect classification as long as this violation is evenly distributed across classes.

To begin, the MultinomialNB class must be imported from the sklearn library. Then, it an instance of it must be created. Thereafter, the model is trained, and finally the performance of the model is evaluated and displayed.

In [14]:
# we import MultinomialNB from sklearn.naive_bayes library
from sklearn.naive_bayes import MultinomialNB

# Create an object of the MultinomialNB class.
nb = MultinomialNB()

# Use the nb object, and fit the training data along the x and y axis.
nb = nb.fit(train_x, train_y)
# Create a predicted y variable for the model to allow us to test the model.
nb_pred_y = nb.predict(test_x)

# Evaluate the prediction results and show the statistics to evaluate the performance of the model and compare the results to other trained models.
print("Results without feature selection:")
print ("f1:" + str(f1_score(nb_pred_y, test_y)))
print ("accuracy:" + str(accuracy_score(nb_pred_y, test_y)))
print ("precision:" + str(precision_score(nb_pred_y, test_y)))
print ("recall:" + str(recall_score(nb_pred_y, test_y)))

Results without feature selection:
f1:0.7526358475263585
accuracy:0.7997373604727511
precision:0.7204968944099379
recall:0.7877758913412564


In [15]:
# Recreate an object of the MultinomialNB class.
nb_sf = MultinomialNB()

# Use the nb object, and fit the training data (with selected features) along the x and y axis.
nb_sf = nb_sf.fit(train_x[sel_ohe_cols], train_y)
# Create a predicted y variable for the model to allow us to test the model.
nb_sf_pred_y = nb_sf.predict(test_x[sel_ohe_cols])

# Evaluate the prediction results and show the statistics to evaluate the performance of the model and compare the results to other trained models.
print("Results with feature selection:")
print ("f1:" + str(f1_score(nb_sf_pred_y, test_y)))
print ("accuracy:" + str(accuracy_score(nb_sf_pred_y, test_y)))
print ("precision:" + str(precision_score(nb_sf_pred_y, test_y)))
print ("recall:" + str(recall_score(nb_sf_pred_y, test_y)))

Results with feature selection:
f1:0.7325581395348837
accuracy:0.788575180564675
precision:0.6847826086956522
recall:0.7875


Here, we can see that the results of each model indicate that performance of the model without feature selection appear to be better, however we also need to understand that the model may be overfitted due to the number of variables used in the bag of words.

## Cross Validation

Here, I also split the data using cross_validate function found in the sklean library as well. In this instance, the model will be split into 5 partitions using the cv=5 parameter, x and y variables, and the KNeighborsClassifier object which will be described later. Each partition will be interchangeably used to train and test the model with a ratio of 4 to 1 (total of 5 partitions) for training and testing. The benefits of this technique allow the model to become more robust with its outcome due to minimizing overfitting and having a biased outcome. This is because each model is evaluated and created using different partitions of the dataset to find the one with the best performance outcome.

In this case, this was only used with the 13 Nearest Neighbors classifier using feature selection, because this process took my computer over 2 hours to complete. As a result, I decided to leave this code here to prove that I have been able to apply the logic behind using these methods and focus more on demonstrating my learned skills by creating solid documentation for this project. Thank you for teaching the class about this wonderful technology. I greatly appreciate learning about all of this from you and your stress on the most important steps in the data analysis process.

### Cross Validation Using KNearestNeighbors and Feature Selection

In [20]:
from sklearn.model_selection import cross_val_predict  # Import the required cross_val_predict function.
y_pred = cross_val_predict(knn_fs, x, y, cv=5) # Split the data into 5 partitions and use the KNeighborsClassifier object/classification model with the x(bag of words using feature selection) and y(target) variable.
y_pred # Print the results.

array([0, 0, 0, ..., 0, 0, 1])

In [25]:
from sklearn.metrics import classification_report # Import the required classification_report function.
print("Classification Report")
cr = classification_report(y, y_pred) # Compare testing and training data.
print(cr.split('macro avg')[0]) # Print the classification report for KNearestNeighbors using feature selection.

Classification Report
              precision    recall  f1-score   support

           0       0.57      1.00      0.73      4342
           1       0.93      0.02      0.03      3271

    accuracy                           0.58      7613
   
