# Logistic Regression
If you are here, then your are coming from [Data Exploration notebook](https://github.com/sarrab/DMML2020_COOP/blob/main/code/EDA.ipynb). In this Notebook, we are going to experiment Logistic Regression (the best model) with raw and cleaned data and we will process with features selection as well.  


In [None]:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import collections  as mc
%load_ext autoreload
%autoreload 2
import pandas as pd 
import seaborn as sns
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
sns.set_style("white")

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
import spacy

from spacy import displacy
import string
from spacy.lang.en.stop_words import STOP_WORDS
from spacy.lang.en import English
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

# Loading 

In [None]:

df_train = pd.read_csv("https://raw.githubusercontent.com/sarrab/DMML2020_COOP/main/data/training_data.csv")
df_test = pd.read_csv("https://raw.githubusercontent.com/sarrab/DMML2020_COOP/main/data/test_data.csv")


In [None]:
df_train_copy = df_train.copy()
df_train_copy_1 = df_train.copy()
df_train_copy_2 = df_train.copy()
df_train_copy_3 = df_train.copy()
df_train_copy_4 = df_train.copy()
df_train_copy_5 = df_train.copy()
df_train_copy_6 = df_train.copy()
df_train_copy_7 = df_train.copy()
df_train_copy_8 = df_train.copy()
df_train_copy_9 = df_train.copy()

Let's use these dataset for trying various things out 

In [None]:
df_train_copy_6.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6471 entries, 0 to 6470
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        6471 non-null   int64 
 1   keyword   6416 non-null   object
 2   location  4330 non-null   object
 3   text      6471 non-null   object
 4   target    6471 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 252.9+ KB


In [None]:

df_test_copy_1 = df_test
df_test_copy_2 = df_test.copy()

# I. Initial Classification Model: 

Our initial assumption is that the by default predictors (keyword, location and especially text) have a correlation or perhaps a causality with the fact that a tweet is either real or not

In [None]:
df_train_copy.tail(5)

Unnamed: 0,id,keyword,location,text,target
6466,4377,earthquake,ARGENTINA,#Earthquake #Sismo M 1.9 - 15km E of Anchorage...,1
6467,3408,derail,,@EmiiliexIrwin Totally agree.She is 23 and kno...,0
6468,9794,trapped,,Hollywood Movie About Trapped Miners Released ...,1
6469,10344,weapons,Beirut/Toronto,Friendly reminder that the only country to eve...,1
6470,1779,buildings%20on%20fire,,Buildings are on fire and they have time for a...,1


In [None]:
!python -m spacy download en

[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.6/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.6/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


## I.1. Raw Model: With Neither Dataframe Cleaning Nor Text Cleaning


At this section, we are training the model on raw data, that means we are not cleaning them at all.

### I.1.1. With One Predictor 'text'

In [None]:
# Select features
X_train = df_train_copy['text'].values # the features we want to analyze
y_train = df_train_copy['target'].values # the labels, or answers, we want to test against


X_test = df_test['text'].values # the features we want to analyze

As per our intuition, we want to do this because TF-IDF takes of care unimportant words like most the stop words

In [None]:
tfidf_vectorizer = TfidfVectorizer(use_idf = True, max_df = 0.95)
tfidf_vectorizer.fit_transform(X_train)
        
train_feature_set = tfidf_vectorizer.transform(X_train)    

In [None]:
classifier = LogisticRegression(solver='saga', max_iter=1000, random_state = 72)


# Create pipeline
pipe = Pipeline([('vectorizer', tfidf_vectorizer),
                 ('classifier', classifier)])

# Fit model on training set
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.95, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, inte

In [None]:
y_pred = pipe.predict(X_test)
y_pred

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
print(f"ACCURACY SCORE:\n{pipe.score(X_train, y_train):.6f}")

ACCURACY SCORE:
0.888425


This model gives a good train accuracy with not sanitized. Which already outperfoms the base rate. 


In [None]:
#Submission

target_pred = pd.DataFrame(y_pred, columns= ['target'])

target_pred.to_csv (r'./submission1_30Nov.csv', index = False, header=True)



### I.1.2 With Two predictors 'text' and 'keyword'

In [None]:
df_train_copy['Text With Keyword'] = df_train_copy['keyword'].str.cat(df_train_copy['text'], sep =" ") 
df_train_copy.drop(['text','keyword'],inplace=True, axis=1)

In [None]:
df_train_copy

Unnamed: 0,id,location,target,Text With Keyword
0,3738,USA,0,destroyed Black Eye 9: A space battle occurred...
1,853,,0,bioterror #world FedEx no longer to transport ...
2,10540,"Palm Beach County, FL",1,windstorm Reality Training: Train falls off el...
3,5988,USA,1,hazardous #Taiwan Grace: expect that large roc...
4,6328,Australia,1,hostage New ISIS Video: ISIS Threatens to Behe...
...,...,...,...,...
6466,4377,ARGENTINA,1,earthquake #Earthquake #Sismo M 1.9 - 15km E o...
6467,3408,,0,derail @EmiiliexIrwin Totally agree.She is 23 ...
6468,9794,,1,trapped Hollywood Movie About Trapped Miners R...
6469,10344,Beirut/Toronto,1,weapons Friendly reminder that the only countr...


In [None]:
X_train_prdc_2 = df_train_copy['Text With Keyword'].values # the features we want to analyze

X_train_prdc_2

array(['destroyed Black Eye 9: A space battle occurred at Star O784 involving 3 fleets totaling 3945 ships with 17 destroyed',
       'bioterror #world FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps  http://t.co/5zDbTktwW7',
       'windstorm Reality Training: Train falls off elevated tracks during windstorm http://t.co/JIOMnrCygT #Paramedic #EMS',
       ...,
       "trapped Hollywood Movie About Trapped Miners Released in Chile: 'The 33' Hollywood movie about trapped miners starring... http://t.co/tyyfG4qQvM",
       'weapons Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. https://t.co/7QrEPylLUK',
       'buildings%20on%20fire Buildings are on fire and they have time for a business meeting #TheStrain'],
      dtype=object)

In [None]:
tfidf_vectorizer_prd_2 = TfidfVectorizer(use_idf = True, max_df = 0.95)
tfidf_vectorizer_prd_2.fit_transform(X_train_prdc_2.astype('U'))

# x = v.fit_transform(df['Review'].values.astype('U'))  ## Even astype(str) would work

tfidf_vectorizer_prd_2.transform(X_train_prdc_2.astype('U'))    

<6471x19316 sparse matrix of type '<class 'numpy.float64'>'
	with 95729 stored elements in Compressed Sparse Row format>

In [None]:
classifier_pred2 = LogisticRegression(solver='saga', max_iter=1000, random_state = 72)


# Create pipeline
pipe_prd_2 = Pipeline([('vectorizer', tfidf_vectorizer_prd_2),
                 ('classifier', classifier_pred2)])

# Fit model on training set
pipe_prd_2.fit(X_train_prdc_2.astype('U'), y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.95, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, inte

In [None]:
y_pred_2p = pipe_prd_2.predict(X_test)
y_pred_2p

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
print(f"ACCURACY SCORE:\n{pipe_prd_2.score(X_train_prdc_2.astype('U'), y_train):.6f}")

ACCURACY SCORE:
0.882862


In [None]:
#Submission

target_pred = pd.DataFrame(y_pred_2p, columns= ['target'])

target_pred.to_csv (r'./submission1_4Dec.csv', index = False, header=True)


#done

### I.1.3 With Two predictors 'text' and 'location'

In [None]:
df_train_copy_6['Text With location'] = df_train_copy_6['location'].str.cat(df_train_copy_6['text'], sep =" ") 
df_train_copy_6.drop(['text','location'],inplace=True, axis=1)

In [None]:
X_train_cl_p2_loc = df_train_copy_6['Text With location'].values
Y_train_cl_p2_loc = df_train_copy_6['target'].values
X_train_cl_p2_loc

array(['USA Black Eye 9: A space battle occurred at Star O784 involving 3 fleets totaling 3945 ships with 17 destroyed',
       nan,
       'Palm Beach County, FL Reality Training: Train falls off elevated tracks during windstorm http://t.co/JIOMnrCygT #Paramedic #EMS',
       ..., nan,
       'Beirut/Toronto Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. https://t.co/7QrEPylLUK',
       nan], dtype=object)

In [None]:
tfidf_vectorizer_prd_2_loc = TfidfVectorizer(use_idf = True, max_df = 0.95)
tfidf_vectorizer_prd_2_loc.fit_transform(X_train_cl_p2_loc.astype('U'))

# x = v.fit_transform(df['Review'].values.astype('U'))  ## Even astype(str) would work

tfidf_vectorizer_prd_2_loc.transform(X_train_cl_p2_loc.astype('U'))    

<6471x16515 sparse matrix of type '<class 'numpy.float64'>'
	with 75320 stored elements in Compressed Sparse Row format>

In [None]:
classifier_pred2_loc = LogisticRegression(solver='saga', max_iter=1000, random_state = 72)


# Create pipeline
pipe_prd_2_loc = Pipeline([('vectorizer', tfidf_vectorizer_prd_2_loc),
                 ('classifier', classifier_pred2_loc)])

# Fit model on training set
pipe_prd_2_loc.fit(X_train_cl_p2_loc.astype('U'), Y_train_cl_p2_loc)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.95, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, inte

In [None]:
y_pred_2p_loc = pipe_prd_2_loc.predict(X_test)
y_pred_2p_loc

array([0, 0, 1, ..., 1, 0, 0])

In [None]:
print(f"ACCURACY SCORE:\n{pipe_prd_2_loc.score(X_train_cl_p2_loc.astype('U'), y_train):.6f}")

ACCURACY SCORE:
0.789059


In [None]:
#No submission

### I.1.4 With Three predictors 'text', 'location' and 'keyword'


In [None]:
df_train_copy['Text With Keyword and Location'] = df_train_copy['location'].str.cat(df_train_copy['Text With Keyword'], sep =" ") 
df_train_copy.drop(['location','Text With Keyword'],inplace=True, axis=1)

In [None]:
df_train_copy

Unnamed: 0,id,target,Text With Keyword and Location
0,3738,0,USA destroyed Black Eye 9: A space battle occu...
1,853,0,
2,10540,1,"Palm Beach County, FL windstorm Reality Traini..."
3,5988,1,USA hazardous #Taiwan Grace: expect that large...
4,6328,1,Australia hostage New ISIS Video: ISIS Threat...
...,...,...,...
6466,4377,1,ARGENTINA earthquake #Earthquake #Sismo M 1.9 ...
6467,3408,0,
6468,9794,1,
6469,10344,1,Beirut/Toronto weapons Friendly reminder that ...


In [None]:
X_train_prdc_3 = df_train_copy['Text With Keyword and Location'].values # the features we want to analyze

X_train_prdc_3#.values

array(['USA destroyed Black Eye 9: A space battle occurred at Star O784 involving 3 fleets totaling 3945 ships with 17 destroyed',
       nan,
       'Palm Beach County, FL windstorm Reality Training: Train falls off elevated tracks during windstorm http://t.co/JIOMnrCygT #Paramedic #EMS',
       ..., nan,
       'Beirut/Toronto weapons Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. https://t.co/7QrEPylLUK',
       nan], dtype=object)

In [None]:
df_train_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6471 entries, 0 to 6470
Data columns (total 3 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   id                              6471 non-null   int64 
 1   target                          6471 non-null   int64 
 2   Text With Keyword and Location  4330 non-null   object
dtypes: int64(2), object(1)
memory usage: 151.8+ KB


In [None]:
tfidf_vectorizer_prd_3 = TfidfVectorizer(use_idf = True, max_df = 0.95)
tfidf_vectorizer_prd_3.fit_transform(X_train_prdc_3.astype('U'))

# x = v.fit_transform(df['Review'].values.astype('U'))  ## Even astype(str) would work

tfidf_vectorizer_prd_3.transform(X_train_prdc_3.astype('U'))    

<6471x16545 sparse matrix of type '<class 'numpy.float64'>'
	with 76370 stored elements in Compressed Sparse Row format>

In [None]:
classifier_pred3 = LogisticRegression(solver='saga', max_iter=1000, random_state = 72)


# Create pipeline
pipe_prd_3 = Pipeline([('vectorizer', tfidf_vectorizer_prd_3),
                 ('classifier', classifier_pred3)])

# Fit model on training set
pipe_prd_3.fit(X_train_prdc_3.astype('U'), y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=0.95, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, inte

In [None]:
y_pred_3p = pipe_prd_3.predict(X_test)
y_pred_3p

array([0, 0, 1, ..., 1, 0, 0])

In [None]:
print(f"ACCURACY SCORE:\n{pipe_prd_3.score(X_train_prdc_2.astype('U'), y_train):.6f}")

ACCURACY SCORE:
0.849791


In [None]:
#Submission

target_pred = pd.DataFrame(y_pred_3p, columns= ['target'])

target_pred.to_csv (r'./submission2_4Dec.csv', index = False, header=True)


## 2. Model With Text Cleaning but without DataFrame Cleaning



### With a single Predictor

#### processing the tweets

preprocess the tweets:
- Remove the stopwords. Use the stop words from `spacy` package.

- Remove the punctuation marks. Use the punctuation marks from the `string` package.

- Lowercase all of the words.

- Lemmatize all of the words. Lemmatize the words using the `spacy` package, similar to what you did in the lab!

In [None]:
# Create a list of punctuation marks
punctuations = string.punctuation

# Create a list of stopwords
spacy_stopwords = spacy.lang.en.stop_words.STOP_WORDS

# Load English language model
sp = spacy.load('en')

# Create tokenizer function
def spacy_tokenizer(sentence):
     # Creating a object document with sp
    mytokens = sp(sentence)

    # Lemmatize each token and convert each token into lowercase
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]

    # Remove stop words and punctuation
    mytokens = [ word for word in mytokens if word not in spacy_stopwords and word not in punctuations ]
    
    return mytokens


# message = df_train_copy["text"]
# token_1 = spacy_tokenizer(str(message.values))
# token_1

#### TF-IDF feature vectors

create the TF-IDF feature vectors for the processed tweets. These will construct you data features that you will use to train a classifier.

In [None]:
tfidf_vector_1 = TfidfVectorizer(tokenizer=spacy_tokenizer) # we use the above defined tokenizer 

In [None]:
# Define classifier
classifier_1 = LogisticRegression(solver='saga', max_iter=1000, random_state = 72)

# Create pipeline
pipe_2 = Pipeline([('vectorizer', tfidf_vector_1),
                 ('classifier', classifier_1)])

# Fit model on training set
pipe_2.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred = pipe_2.predict(X_test)
y_pred

array([0, 0, 1, ..., 1, 0, 1])

Accuracy

- What is the test accuracy of the classifier?

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_2.score(X_train, y_train):.6f}")

ACCURACY SCORE:
0.888734


We observe that the accuracy has improved after pre-processing.
#### We'll clean the data and redo the Model training 

In [None]:
#Submission 2

target_pred = pd.DataFrame(y_pred, columns= ['target'])

target_pred.to_csv (r'./submission2_30Nov.csv', index = False, header=True)

### With two Predictors: Keyword and Text

In [None]:
# Define classifier
classifier_1 = LogisticRegression(solver='saga', max_iter=1000, random_state = 72)

# Create pipeline
pipe_2 = Pipeline([('vectorizer', tfidf_vector_1),
                 ('classifier', classifier_1)])

# Fit model on training set
pipe_2.fit(X_train_prdc_2.astype('U'), y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_p_p2 = pipe_2.predict(X_test)
y_pred_p_p2

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_2.score(X_train_prdc_2.astype('U'), y_train):.6f}")

ACCURACY SCORE:
0.881626


In [None]:
#Submission 2

target_pred = pd.DataFrame(y_pred_p_p2, columns= ['target'])

target_pred.to_csv (r'./submission3_4Dec.csv', index = False, header=True)

### With two Predictors: Location and Text

In [None]:
# Define classifier
classifier_1 = LogisticRegression(solver='saga', max_iter=1000, random_state = 72)

# Create pipeline
pipe_2 = Pipeline([('vectorizer', tfidf_vector_1),
                 ('classifier', classifier_1)])

# Fit model on training set
pipe_2.fit(X_train_cl_p2_loc.astype('U'), y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_p_p2 = pipe_2.predict(X_test)
y_pred_p_p2

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_2.score(X_train_cl_p2_loc.astype('U'), y_train):.6f}")

ACCURACY SCORE:
0.790450


In [None]:
# Submission

### With three Predictors

In [None]:
# Define classifier
classifier_1 = LogisticRegression(solver='saga', max_iter=1000, random_state = 72)

# Create pipeline
pipe_2 = Pipeline([('vectorizer', tfidf_vector_1),
                 ('classifier', classifier_1)])

# Fit model on training set
pipe_2.fit(X_train_prdc_3.astype('U'), y_train)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_p_p3 = pipe_2.predict(X_test)
y_pred_p_p3

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_2.score(X_train_prdc_3.astype('U'), y_train):.6f}")

ACCURACY SCORE:
0.785659


In [None]:
#drastic drop in accuracy when predictors are not sanitized #No submission

# Cleaning 

In [None]:
df_train_copy_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6471 entries, 0 to 6470
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        6471 non-null   int64 
 1   keyword   6416 non-null   object
 2   location  4330 non-null   object
 3   text      6471 non-null   object
 4   target    6471 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 252.9+ KB


### Let's start with the duplicated values 

In [None]:
dupli=df_train_copy_1[df_train_copy_1[['keyword','location','text']].duplicated() == True]
dupli 

Unnamed: 0,id,keyword,location,text,target
790,6097,hellfire,,The Prophet (peace be upon him) said 'Save you...,1
978,4072,displaced,Pedophile hunting ground,.POTUS #StrategicPatience is a strategy for #G...,1
1455,9225,suicide%20bombing,,'Suicide bombing at [location named]...' #prem...,1
1481,9135,suicide%20bomb,Nigeria,#Bestnaijamade: 16yr old PKK suicide bomber wh...,1
1492,4077,displaced,Pedophile hunting ground,.POTUS #StrategicPatience is a strategy for #G...,1
1737,4691,engulfed,,He came to a land which was engulfed in tribal...,0
1789,5157,fatal,,11-Year-Old Boy Charged With Manslaughter of T...,1
1822,6132,hellfire,,The Prophet (peace be upon him) said 'Save you...,1
1946,5113,fatal,,11-Year-Old Boy Charged With Manslaughter of T...,1
2212,5140,fatal,,11-Year-Old Boy Charged With Manslaughter of T...,1


#### we observe that there are duplicate entries

In [None]:
df_train_copy_1.drop([790,978,1455,1481,1492,1737,1789,1822,1946,2212,2363,2597,2628,2658,2941,3076,3153,3255,3275,3340,3359,3435,3541,3575,3673,3852,3890,3891,3938,3960,3976,3988,4037,4061,4111,4119,4451,4548,4592,4632,4735,4890,4908,6099,6163,6272,6440], inplace = True)

We have dropped this duplicated entries. We are left with 6424 rows

In [None]:
#keeping location
df_train_copy_2.drop([790,978,1455,1481,1492,1737,1789,1822,1946,2212,2363,2597,2628,2658,2941,3076,3153,3255,3275,3340,3359,3435,3541,3575,3673,3852,3890,3891,3938,3960,3976,3988,4037,4061,4111,4119,4451,4548,4592,4632,4735,4890,4908,6099,6163,6272,6440], inplace = True)

In [None]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6471 entries, 0 to 6470
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        6471 non-null   int64 
 1   keyword   6416 non-null   object
 2   location  4330 non-null   object
 3   text      6471 non-null   object
 4   target    6471 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 252.9+ KB


In [None]:
df_train_copy_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6424 entries, 0 to 6470
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        6424 non-null   int64 
 1   keyword   6369 non-null   object
 2   location  4312 non-null   object
 3   text      6424 non-null   object
 4   target    6424 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 301.1+ KB


In [None]:
df_train_copy_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6424 entries, 0 to 6470
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        6424 non-null   int64 
 1   keyword   6369 non-null   object
 2   location  4312 non-null   object
 3   text      6424 non-null   object
 4   target    6424 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 301.1+ KB


#### we remark that keyword and location have some missing values.
Location:
   > About one third of locations are missing. 
   > Many of locations are, as per our findings in EDA, pseudonymes.
   > Thus, we need to drop it. 

In [None]:
df_train_copy_1.drop('location', axis = 1, inplace = True)

In [None]:
df_train_copy_1

Unnamed: 0,id,keyword,text,target
0,3738,destroyed,Black Eye 9: A space battle occurred at Star O...,0
1,853,bioterror,#world FedEx no longer to transport bioterror ...,0
2,10540,windstorm,Reality Training: Train falls off elevated tra...,1
3,5988,hazardous,#Taiwan Grace: expect that large rocks trees m...,1
4,6328,hostage,New ISIS Video: ISIS Threatens to Behead Croat...,1
...,...,...,...,...
6466,4377,earthquake,#Earthquake #Sismo M 1.9 - 15km E of Anchorage...,1
6467,3408,derail,@EmiiliexIrwin Totally agree.She is 23 and kno...,0
6468,9794,trapped,Hollywood Movie About Trapped Miners Released ...,1
6469,10344,weapons,Friendly reminder that the only country to eve...,1


#### And Of course is obvious that Id is not of much importance for us.

In [None]:
df_train_copy_1.drop('id', axis = 1, inplace = True) #useless feature

In [None]:
#with location
df_train_copy_2.drop('id', axis = 1, inplace = True) #useless feature

We, therefore, dropped it.

In [None]:
df_train_copy_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6424 entries, 0 to 6470
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   keyword  6369 non-null   object
 1   text     6424 non-null   object
 2   target   6424 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 200.8+ KB


We observe that Keyword feature is missing some values

### Let's deal with missing values

In [None]:
df_train_copy_1['keyword'].isnull().sum()

55

In [None]:
df_train_copy_1[df_train_copy_1['keyword'].isnull() == True]

Unnamed: 0,keyword,text,target
46,,#breaking #LA Refugio oil spill may have been ...,1
224,,RT @LivingSafely: #NWS issues Severe #Thunders...,1
263,,I'm on top of the hill and I can see a fire in...,1
563,,"13,000 people receive #wildfires evacuation or...",1
892,,Three people died from the heat wave so far,1
1020,,What a wonderful day!,0
1101,,Pic of 16yr old PKK suicide bomber who detonat...,1
1231,,#flood #disaster Heavy rain causes flash flood...,1
1281,,#raining #flooding #Florida #TampaBay #Tampa 1...,1
1285,,#Sismo DETECTADO #JapÌ_n 15:41:07 Seismic inte...,1


In [None]:
df_train_copy_to_remove = df_train_copy_1[df_train_copy_1['keyword'].isnull()]
index_to_remove=df_train_copy_to_remove.index
index_to_remove

Int64Index([  46,  224,  263,  563,  892, 1020, 1101, 1231, 1281, 1285, 1544,
            1606, 1776, 1890, 1892, 2029, 2082, 2215, 2485, 2594, 2624, 2631,
            2681, 2720, 3281, 3351, 3503, 3707, 3723, 3739, 3925, 3990, 4146,
            4315, 4453, 4588, 4688, 4793, 4854, 4989, 5536, 5623, 5660, 5771,
            5929, 5937, 5997, 6006, 6103, 6153, 6236, 6310, 6315, 6407, 6442],
           dtype='int64')

In [None]:
df_train_copy_1.drop(index_to_remove, inplace = True)

In [None]:
#dealing missing location
df_train_copy_2[df_train_copy_2['location'].isnull() == True]

Unnamed: 0,keyword,location,text,target
1,bioterror,,#world FedEx no longer to transport bioterror ...,0
9,thunder,,The thunder shook my house woke my sister and ...,1
14,seismic,,Thanks Benson &amp; Clegg for the #follow! Che...,0
15,bridge%20collapse,,Mexico: construction of bridge collapse killså...,1
20,violent%20storm,,Storm blitzes Traverse City disrupts Managemen...,1
...,...,...,...,...
6462,epicentre,,Epicentre - Cydia Tweak - https://t.co/WKmfDig...,0
6463,sinkhole,,150-Foot Sinkhole Opens in Lowndes County Resi...,1
6467,derail,,@EmiiliexIrwin Totally agree.She is 23 and kno...,0
6468,trapped,,Hollywood Movie About Trapped Miners Released ...,1


In [None]:
df_train_copy_to_remove_loc = df_train_copy_2[df_train_copy_2['location'].isnull()]
index_to_remove_loc = df_train_copy_to_remove_loc.index
index_to_remove_loc

Int64Index([   1,    9,   14,   15,   20,   21,   24,   25,   28,   29,
            ...
            6453, 6454, 6456, 6460, 6461, 6462, 6463, 6467, 6468, 6470],
           dtype='int64', length=2112)

In [None]:
df_train_copy_2.drop(index_to_remove_loc, inplace = True)

In [None]:
df_train_copy_2['keyword'].isnull().sum()

0

In [None]:
#dealing missing location
df_train_copy_3 = df_train
df_train_copy_4 = df_train
df_train_copy_5 = df_train

In [None]:
#keeping location
df_train_copy_3.drop([790,978,1455,1481,1492,1737,1789,1822,1946,2212,2363,2597,2628,2658,2941,3076,3153,3255,3275,3340,3359,3435,3541,3575,3673,3852,3890,3891,3938,3960,3976,3988,4037,4061,4111,4119,4451,4548,4592,4632,4735,4890,4908,6099,6163,6272,6440], inplace = True)

In [None]:
#dealing missing location
# df_train_copy_3[df_train_copy_3['location'].isnull() == True]
df_train_copy_3.drop(index_to_remove_loc, inplace = True)

In [None]:
clean_pos = df_train_copy_3.copy()

In [None]:
df_train_copy_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4312 entries, 0 to 6469
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        4312 non-null   int64 
 1   keyword   4312 non-null   object
 2   location  4312 non-null   object
 3   text      4312 non-null   object
 4   target    4312 non-null   int64 
dtypes: int64(2), object(3)
memory usage: 202.1+ KB


In [None]:
df_train_clean_copy = df_train_copy_1.copy()
df_train_clean_copy.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6369 entries, 0 to 6470
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   keyword  6369 non-null   object
 1   text     6369 non-null   object
 2   target   6369 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 199.0+ KB


#### Merging Text with keyWord

In [None]:
df_train_copy_1['Text With Keyword'] = df_train_copy_1['keyword'].str.cat(df_train_copy_1['text'], sep =" ") 
df_train_copy_1.drop(['text','keyword'],inplace=True, axis=1)

In [None]:
df_train_copy_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6369 entries, 0 to 6470
Data columns (total 2 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   target             6369 non-null   int64 
 1   Text With Keyword  6369 non-null   object
dtypes: int64(1), object(1)
memory usage: 149.3+ KB


#### Merging Text with keyWord and location


In [None]:
df_train_copy_2.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4312 entries, 0 to 6469
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   keyword   4312 non-null   object
 1   location  4312 non-null   object
 2   text      4312 non-null   object
 3   target    4312 non-null   int64 
dtypes: int64(1), object(3)
memory usage: 168.4+ KB


In [None]:
df_train_copy_2['Text With Keyword'] = df_train_copy_2['keyword'].str.cat(df_train_copy_2['text'], sep =" ") 
# df_train_copy_1.drop(['text','keyword'],inplace=True, axis=1)
df_train_copy_2['Text With Keyword and Location'] = df_train_copy_2['location'].str.cat(df_train_copy_2['Text With Keyword'], sep =" ") 
df_train_copy_2.drop(['text','keyword','location','Text With Keyword'],inplace=True, axis=1)

In [None]:
df_train_copy_2.info() #ready to be used

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4312 entries, 0 to 6469
Data columns (total 2 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   target                          4312 non-null   int64 
 1   Text With Keyword and Location  4312 non-null   object
dtypes: int64(1), object(1)
memory usage: 101.1+ KB


#### Merging location and text 

In [None]:
df_train_copy_3['Text With location'] = df_train_copy_3['location'].str.cat(df_train_copy_3['text'], sep =" ") 
# # df_train_copy_1.drop(['text','keyword'],inplace=True, axis=1)
# df_train_copy_3['Text With Keyword and Location'] = df_train_copy_3['location'].str.cat(df_train_copy_2['Text With Keyword'], sep =" ") 
# df_train_copy_3.drop(['text','keyword','location','Text With Keyword'],inplace=True, axis=1)
df_train_copy_3

Unnamed: 0,id,keyword,location,text,target,Text With location
0,3738,destroyed,USA,Black Eye 9: A space battle occurred at Star O...,0,USA Black Eye 9: A space battle occurred at St...
2,10540,windstorm,"Palm Beach County, FL",Reality Training: Train falls off elevated tra...,1,"Palm Beach County, FL Reality Training: Train ..."
3,5988,hazardous,USA,#Taiwan Grace: expect that large rocks trees m...,1,USA #Taiwan Grace: expect that large rocks tre...
4,6328,hostage,Australia,New ISIS Video: ISIS Threatens to Behead Croat...,1,Australia New ISIS Video: ISIS Threatens to B...
5,6669,landslide,Scotland,FreeBesieged: .MartinMJ22 YouGov Which '#Tory ...,1,Scotland FreeBesieged: .MartinMJ22 YouGov Whic...
...,...,...,...,...,...,...
6459,4472,electrocuted,"South West, England",MT @Earths_Voice Treatment of #tigers in #Chin...,1,"South West, England MT @Earths_Voice Treatment..."
6464,5310,fear,USA,Daily Reflections\r\nAugust 6\r\nDRIVEN\r\nDri...,0,USA Daily Reflections\r\nAugust 6\r\nDRIVEN\r\...
6465,8330,rubble,"Calgary, AB, Canada",China's Stock Market Crash: Are There Gems In ...,1,"Calgary, AB, Canada China's Stock Market Crash..."
6466,4377,earthquake,ARGENTINA,#Earthquake #Sismo M 1.9 - 15km E of Anchorage...,1,ARGENTINA #Earthquake #Sismo M 1.9 - 15km E of...


In [None]:
df3_copy = df_train_copy_3.copy()
# df_train_copy_3.drop(['id','text','keyword','location'],inplace=True, axis=1)
df_train_copy_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4312 entries, 0 to 6469
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  4312 non-null   int64 
 1   keyword             4312 non-null   object
 2   location            4312 non-null   object
 3   text                4312 non-null   object
 4   target              4312 non-null   int64 
 5   Text With location  4312 non-null   object
dtypes: int64(2), object(4)
memory usage: 235.8+ KB


## 3. Model With Text Cleaning And DataFrame Cleaning

Here, the below model are built on data cleaned related to both the dataframe cleaning, such as missing values, and text cleaning.

### 3.1. Non Balanced

In [None]:
df_train_clean_copy

Unnamed: 0,keyword,text,target
0,destroyed,Black Eye 9: A space battle occurred at Star O...,0
1,bioterror,#world FedEx no longer to transport bioterror ...,0
2,windstorm,Reality Training: Train falls off elevated tra...,1
3,hazardous,#Taiwan Grace: expect that large rocks trees m...,1
4,hostage,New ISIS Video: ISIS Threatens to Behead Croat...,1
...,...,...,...
6466,earthquake,#Earthquake #Sismo M 1.9 - 15km E of Anchorage...,1
6467,derail,@EmiiliexIrwin Totally agree.She is 23 and kno...,0
6468,trapped,Hollywood Movie About Trapped Miners Released ...,1
6469,weapons,Friendly reminder that the only country to eve...,1


#### Single Predictor

In [None]:
# Select features
X_train_cl_1 = df_train_clean_copy['text'].values # the features we want to analyze
Y_train_cl_1 = df_train_clean_copy['target'].values
X_train_cl_1

array(['Black Eye 9: A space battle occurred at Star O784 involving 3 fleets totaling 3945 ships with 17 destroyed',
       '#world FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps  http://t.co/5zDbTktwW7',
       'Reality Training: Train falls off elevated tracks during windstorm http://t.co/JIOMnrCygT #Paramedic #EMS',
       ...,
       "Hollywood Movie About Trapped Miners Released in Chile: 'The 33' Hollywood movie about trapped miners starring... http://t.co/tyyfG4qQvM",
       'Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. https://t.co/7QrEPylLUK',
       'Buildings are on fire and they have time for a business meeting #TheStrain'],
      dtype=object)

In [None]:
Y_train_cl_1

array([0, 0, 1, ..., 1, 1, 1])

### PreProcessing 

#### TF-IDF feature vectors

In [None]:
tfidf_vector_2 = TfidfVectorizer(tokenizer=spacy_tokenizer) # we use the above defined tokenizer 

In [None]:
# Define classifier
classifier_2 = LogisticRegression(solver='saga', max_iter=1000, random_state = 72)

# Create pipeline
pipe_3 = Pipeline([('vectorizer', tfidf_vector_2),
                 ('classifier', classifier_2)])

# Fit model on training set
pipe_3.fit(X_train_cl_1, Y_train_cl_1)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_3 = pipe_3.predict(X_test)
y_pred_3

array([0, 0, 1, ..., 1, 0, 1])

#### Accuracy

- What is the test accuracy of the classifier?

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_3.score(X_train_cl_1, Y_train_cl_1):.6f}")


#### there is a sligth improvement after cleaning 

In [None]:
#Submission

target_pred = pd.DataFrame(y_pred_3, columns= ['target'])

target_pred.to_csv (r'./submission3_30Nov.csv', index = False, header=True)


### Two Predictors: Text and Keyword

In [None]:
X_train_cl_p2 = df_train_copy_1['Text With Keyword'].values
Y_train_cl_p2 = df_train_copy_1['target'].values
X_train_cl_p2

In [None]:
Y_train_cl_p2

In [None]:
# Define classifier

pipe_3.fit(X_train_cl_p2, Y_train_cl_p2)

In [None]:
y_pred_3 = pipe_3.predict(X_test)
y_pred_3

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_2.score(X_train_cl_p2, Y_train_cl_p2):.6f}")

ACCURACY SCORE:
0.846601


In [None]:
#there is a significant drop when in case there are 2 predictors

### Two Predictors: Text and location

In [None]:
df_train_copy_3.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4312 entries, 0 to 6469
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   id                  4312 non-null   int64 
 1   keyword             4312 non-null   object
 2   location            4312 non-null   object
 3   text                4312 non-null   object
 4   target              4312 non-null   int64 
 5   Text With location  4312 non-null   object
dtypes: int64(2), object(4)
memory usage: 235.8+ KB


In [None]:
X_train_cl_p2_loc = df_train_copy_3['Text With location'].values
Y_train_cl_p2_loc = df_train_copy_3['target'].values
X_train_cl_p2_loc

array(['USA Black Eye 9: A space battle occurred at Star O784 involving 3 fleets totaling 3945 ships with 17 destroyed',
       'Palm Beach County, FL Reality Training: Train falls off elevated tracks during windstorm http://t.co/JIOMnrCygT #Paramedic #EMS',
       'USA #Taiwan Grace: expect that large rocks trees mud unstable and/or saturated land may slide ..very hazardous in hilly/mountain areas...',
       ...,
       "Calgary, AB, Canada China's Stock Market Crash: Are There Gems In The Rubble? http://t.co/BqBLWiw08g #ROIMentor #yycwalks",
       'ARGENTINA #Earthquake #Sismo M 1.9 - 15km E of Anchorage Alaska: Time2015-08-06 00:11:16 UTC2015-08-05 16:11:16 -08:00 ... http://t.co/Z0VeR1hVM9',
       'Beirut/Toronto Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. https://t.co/7QrEPylLUK'],
      dtype=object)

In [None]:

pipe_3.fit(X_train_cl_p2_loc, Y_train_cl_p2_loc)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_3 = pipe_3.predict(X_test)
y_pred_3

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_3.score(X_train_cl_p2_loc, Y_train_cl_p2_loc):.6f}") #hv a look

ACCURACY SCORE:
0.897263


In [None]:
#Submission

target_pred = pd.DataFrame(y_pred_3, columns= ['target'])

target_pred.to_csv (r'./submission4_4Dec.csv', index = False, header=True)



### Three Predictors

In [None]:
X_train_cl_p3_loc = df_train_copy_2['Text With Keyword and Location'].values
Y_train_cl_p3_loc = df_train_copy_2['target'].values
X_train_cl_p3_loc

array(['USA destroyed Black Eye 9: A space battle occurred at Star O784 involving 3 fleets totaling 3945 ships with 17 destroyed',
       'Palm Beach County, FL windstorm Reality Training: Train falls off elevated tracks during windstorm http://t.co/JIOMnrCygT #Paramedic #EMS',
       'USA hazardous #Taiwan Grace: expect that large rocks trees mud unstable and/or saturated land may slide ..very hazardous in hilly/mountain areas...',
       ...,
       "Calgary, AB, Canada rubble China's Stock Market Crash: Are There Gems In The Rubble? http://t.co/BqBLWiw08g #ROIMentor #yycwalks",
       'ARGENTINA earthquake #Earthquake #Sismo M 1.9 - 15km E of Anchorage Alaska: Time2015-08-06 00:11:16 UTC2015-08-05 16:11:16 -08:00 ... http://t.co/Z0VeR1hVM9',
       'Beirut/Toronto weapons Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. https://t.co/7QrEPylLUK'],
      dtype=object)

In [None]:
Y_train_cl_p3_loc

array([0, 1, 1, ..., 1, 1, 1])

In [None]:
# Define classifier

pipe_3.fit(X_train_cl_p3_loc, Y_train_cl_p3_loc)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_cl_3 = pipe_3.predict(X_test)
y_pred_cl_3

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_3.score(X_train_cl_p3_loc, Y_train_cl_p3_loc):.6f}")

ACCURACY SCORE:
0.891698


In [None]:
#Submission

target_pred = pd.DataFrame(y_pred_cl_3, columns= ['target'])

target_pred.to_csv (r'./submission5_4Dec.csv', index = False, header=True)


### 3.2. Balanced df

#### Balancing about text

In [None]:
df_train_clean_copy

Unnamed: 0,keyword,text,target
0,destroyed,Black Eye 9: A space battle occurred at Star O...,0
1,bioterror,#world FedEx no longer to transport bioterror ...,0
2,windstorm,Reality Training: Train falls off elevated tra...,1
3,hazardous,#Taiwan Grace: expect that large rocks trees m...,1
4,hostage,New ISIS Video: ISIS Threatens to Behead Croat...,1
...,...,...,...
6466,earthquake,#Earthquake #Sismo M 1.9 - 15km E of Anchorage...,1
6467,derail,@EmiiliexIrwin Totally agree.She is 23 and kno...,0
6468,trapped,Hollywood Movie About Trapped Miners Released ...,1
6469,weapons,Friendly reminder that the only country to eve...,1


In [None]:
# Create balanced dataframe - base rate = 0.5 around text #single predictor
df_new_blc = pd.concat([df_train_clean_copy[df_train_clean_copy["target"] == 0].sample(len(df_train_clean_copy[df_train_clean_copy["target"] == 1])), df_train_clean_copy[df_train_clean_copy["target"] == 1]], axis=0).reset_index()
df_new_blc

Unnamed: 0,index,keyword,text,target
0,4936,devastated,Foto: ÛÏLove isnÛªt everything to me anymore...,0
1,444,obliterated,Me- Don't bother calling or texting me because...,0
2,210,obliterated,Obliterated,0
3,1958,accident,???? it was an accident http://t.co/Oia5fxi4gM,0
4,4773,drowned,Given the US coalition's fondness for backing ...,0
...,...,...,...,...
5403,6465,rubble,China's Stock Market Crash: Are There Gems In ...,1
5404,6466,earthquake,#Earthquake #Sismo M 1.9 - 15km E of Anchorage...,1
5405,6468,trapped,Hollywood Movie About Trapped Miners Released ...,1
5406,6469,weapons,Friendly reminder that the only country to eve...,1


In [None]:
len(df_new_blc[df_new_blc['target']==0])

2704

### Single Predictor balanced

In [None]:
X_train_blcd_p1 = df_new_blc['text'].values
Y_train_blcd_p1 = df_new_blc['target'].values
X_train_blcd_p1

array(['Foto: \x89ÛÏLove isn\x89Ûªt everything to me anymore. The last album I made [Red] was a devastated record because... http://t.co/T5agPS7T2B',
       "Me- Don't bother calling or texting me because my phone is obliterated\r\n\r\n*has 7k missed calls and messages*",
       'Obliterated', ...,
       "Hollywood Movie About Trapped Miners Released in Chile: 'The 33' Hollywood movie about trapped miners starring... http://t.co/tyyfG4qQvM",
       'Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. https://t.co/7QrEPylLUK',
       'Buildings are on fire and they have time for a business meeting #TheStrain'],
      dtype=object)

In [None]:
Y_train_blcd_p1

array([0, 0, 0, ..., 1, 1, 1])

In [None]:
# # Define classifier
# classifier_2 = LogisticRegression(solver='saga', max_iter=1000, random_state = 72)

# # Create pipeline
# pipe_3 = Pipeline([('vectorizer', tfidf_vector_2),
#                  ('classifier', classifier_2)])

# Fit model on training set
pipe_3.fit(X_train_blcd_p1, Y_train_blcd_p1)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_blcd = pipe_3.predict(X_test)
y_pred_blcd

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_3.score(X_train_blcd_p1, Y_train_blcd_p1):.6f}")

ACCURACY SCORE:
0.904401


In [None]:
#Submission 2

target_pred = pd.DataFrame(y_pred_blcd, columns= ['target'])

target_pred.to_csv (r'./submission6_4Dec.csv', index = False, header=True)

### Two Predictors : Text and keyword

In [None]:
#we need a clean df with these two params

In [None]:
# Create balanced dataframe - base rate = 0.5 around text #single predictor
df_new_blc_2pdt = pd.concat([df_train_copy_1[df_train_copy_1["target"] == 0].sample(len(df_train_copy_1[df_train_copy_1["target"] == 1])), df_train_copy_1[df_train_copy_1["target"] == 1]], axis=0).reset_index()
df_new_blc_2pdt

Unnamed: 0,index,target,Text With Keyword
0,2204,0,desolation 'cause right now I can read too goo...
1,4830,0,screamed 26 people have screamed right in my e...
2,5280,0,hellfire Hellfire! We donÛªt even want to thi...
3,5559,0,collide Students COLLIDE this Fri/Sat - regist...
4,4155,0,landslide Now Playing: Landslide by Smashing P...
...,...,...,...
5403,6465,1,rubble China's Stock Market Crash: Are There G...
5404,6466,1,earthquake #Earthquake #Sismo M 1.9 - 15km E o...
5405,6468,1,trapped Hollywood Movie About Trapped Miners R...
5406,6469,1,weapons Friendly reminder that the only countr...


In [None]:
X_train_blcd_p2 = df_new_blc_2pdt['Text With Keyword'].values
Y_train_blcd_p2 = df_new_blc_2pdt['target'].values
X_train_blcd_p2

array(["desolation 'cause right now I can read too good don't send me no letters no. not unless you're gonna mail them from desolation row ~",
       'screamed 26 people have screamed right in my ear this week at camp. TWENTY-SIX! And we still have 1/2 day left. #CedarGlade2015',
       'hellfire Hellfire! We don\x89Ûªt even want to think about it or mention it so let\x89Ûªs not do anything that leads to it #islam!',
       ...,
       "trapped Hollywood Movie About Trapped Miners Released in Chile: 'The 33' Hollywood movie about trapped miners starring... http://t.co/tyyfG4qQvM",
       'weapons Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. https://t.co/7QrEPylLUK',
       'buildings%20on%20fire Buildings are on fire and they have time for a business meeting #TheStrain'],
      dtype=object)

In [None]:

# Fit model on training set
pipe_3.fit(X_train_blcd_p2, Y_train_blcd_p2)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_blc_p2 = pipe_3.predict(X_test)
y_pred_blc_p2

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_3.score(X_train_blcd_p2, Y_train_blcd_p2):.6f}")

ACCURACY SCORE:
0.886834


In [None]:
#submission

target_pred = pd.DataFrame(y_pred_blc_p2, columns= ['target'])

target_pred.to_csv (r'./submission7_4Dec.csv', index = False, header=True)

### Two Predictors : Text and location

In [None]:
# Create balanced dataframe - base rate = 0.5 around text #single predictor
df_new_blc_2pd_loc = pd.concat([df_train_copy_3[df_train_copy_3["target"] == 0].sample(len(df_train_copy_3[df_train_copy_3["target"] == 1])), df_train_copy_3[df_train_copy_3["target"] == 1]], axis=0).reset_index()
df_new_blc_2pd_loc

Unnamed: 0,index,id,keyword,location,text,target,Text With location
0,5371,2670,crush,GLOBAL,Had a minute alone with my crush??...it was an...,0,GLOBAL Had a minute alone with my crush??...it...
1,1870,2579,crash,"Melbourne, Australia",@DestinyTheGame @Bungie @PlayStation Getting k...,0,"Melbourne, Australia @DestinyTheGame @Bungie..."
2,566,5449,first%20responders,"Washington, DC",Some good info to help first responders cope- ...,0,"Washington, DC Some good info to help first re..."
3,11,7725,panicking,UK,My dad is panicking as my weight loss means he...,0,UK My dad is panicking as my weight loss means...
4,2779,10114,upheaval,Jamaica,Series finale of #TheGame :( It survived so mu...,0,Jamaica Series finale of #TheGame :( It surviv...
...,...,...,...,...,...,...,...
3691,6455,3725,destroyed,"ÌÏT: 6.4682,3.18287",Flood: Two people dead 60 houses destroyed in ...,1,"ÌÏT: 6.4682,3.18287 Flood: Two people dead 60 ..."
3692,6459,4472,electrocuted,"South West, England",MT @Earths_Voice Treatment of #tigers in #Chin...,1,"South West, England MT @Earths_Voice Treatment..."
3693,6465,8330,rubble,"Calgary, AB, Canada",China's Stock Market Crash: Are There Gems In ...,1,"Calgary, AB, Canada China's Stock Market Crash..."
3694,6466,4377,earthquake,ARGENTINA,#Earthquake #Sismo M 1.9 - 15km E of Anchorage...,1,ARGENTINA #Earthquake #Sismo M 1.9 - 15km E of...


In [None]:
X_train_blcd_p2_loc = df_new_blc_2pd_loc['Text With location'].values
Y_train_blcd_p2_loc = df_new_blc_2pd_loc['target'].values
X_train_blcd_p2_loc

array(['GLOBAL Had a minute alone with my crush??...it was an overrated experience...smh',
       "  Melbourne, Australia @DestinyTheGame @Bungie @PlayStation Getting kicked out by that crash is one of the worst experiences I've had playing video games.",
       'Washington, DC Some good info to help first responders cope- Individual Resilience: Factsheet for Responders- http://t.co/FcFpijiqt5',
       ...,
       "Calgary, AB, Canada China's Stock Market Crash: Are There Gems In The Rubble? http://t.co/BqBLWiw08g #ROIMentor #yycwalks",
       'ARGENTINA #Earthquake #Sismo M 1.9 - 15km E of Anchorage Alaska: Time2015-08-06 00:11:16 UTC2015-08-05 16:11:16 -08:00 ... http://t.co/Z0VeR1hVM9',
       'Beirut/Toronto Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. https://t.co/7QrEPylLUK'],
      dtype=object)

In [None]:

# Fit model on training set
pipe_3.fit(X_train_blcd_p2_loc, Y_train_blcd_p2_loc)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_blc_p2_loc = pipe_3.predict(X_test)
y_pred_blc_p2_loc

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_3.score(X_train_blcd_p2_loc, Y_train_blcd_p2_loc):.6f}")

ACCURACY SCORE:
0.927219


In [None]:
#submission

target_pred = pd.DataFrame(y_pred_blc_p2_loc, columns= ['target'])

target_pred.to_csv (r'./submission8_4Dec.csv', index = False, header=True)

### Three Predictors

In [None]:
# Create balanced dataframe - base rate = 0.5 around text #single predictor
df_new_blc_3pd = pd.concat([df_train_copy_2[df_train_copy_2["target"] == 0].sample(len(df_train_copy_2[df_train_copy_2["target"] == 1])), df_train_copy_2[df_train_copy_2["target"] == 1]], axis=0).reset_index()
df_new_blc_3pd

Unnamed: 0,index,target,Text With Keyword and Location
0,1686,0,London destruction @Bonn1eGreer The Angel of H...
1,4546,0,Trinidad & Tobago crushed Disillusioned lead c...
2,5202,0,Right here damage @IndiGo6E But if you are car...
3,1997,0,Tn survived The 'twins' survived their first d...
4,5843,0,"Surry Hills, Sydney rescue Any lengths to resc..."
...,...,...,...
3691,6455,1,"ÌÏT: 6.4682,3.18287 destroyed Flood: Two peopl..."
3692,6459,1,"South West, England electrocuted MT @Earths_Vo..."
3693,6465,1,"Calgary, AB, Canada rubble China's Stock Marke..."
3694,6466,1,ARGENTINA earthquake #Earthquake #Sismo M 1.9 ...


In [None]:
X_train_blcd_p3 = df_new_blc_3pd['Text With Keyword and Location'].values
Y_train_blcd_p3 = df_new_blc_3pd['target'].values
X_train_blcd_p3

array(['London destruction @Bonn1eGreer The Angel of History propelled into the future by the winds of progress leaves in its wake piles of death and destruction. WB',
       'Trinidad & Tobago crushed Disillusioned lead character \r\nCheck\r\nHappy go lucky free spirit girl\r\nCheck\r\nDream life crushed\r\nCheck\r\nGreat music\r\nCheck\r\nAll Crowe tropes intact',
       "Right here damage @IndiGo6E But if you are carful about spotting damage @the time of check in why not @the time of giving away baggage?! It's my loss all d wy",
       ...,
       "Calgary, AB, Canada rubble China's Stock Market Crash: Are There Gems In The Rubble? http://t.co/BqBLWiw08g #ROIMentor #yycwalks",
       'ARGENTINA earthquake #Earthquake #Sismo M 1.9 - 15km E of Anchorage Alaska: Time2015-08-06 00:11:16 UTC2015-08-05 16:11:16 -08:00 ... http://t.co/Z0VeR1hVM9',
       'Beirut/Toronto weapons Friendly reminder that the only country to ever use nuclear weapons is the U.S. And it was against civilians. htt

In [None]:
# Fit model on training set
pipe_3.fit(X_train_blcd_p3, Y_train_blcd_p3)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_blc_p3 = pipe_3.predict(X_test)
y_pred_blc_p3

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_3.score(X_train_blcd_p3, Y_train_blcd_p3):.6f}")

ACCURACY SCORE:
0.902056


In [None]:
#submission
target_pred = pd.DataFrame(y_pred_blc_p3, columns= ['target'])

target_pred.to_csv (r'./submission9_4Dec.csv', index = False, header=True)

### broadening 

In [None]:
from sklearn.utils import resample


### Single Predictor

In [None]:
df_train_clean_copy[df_train_clean_copy['target'] == 0]

Unnamed: 0,keyword,text,target
0,destroyed,Black Eye 9: A space battle occurred at Star O...,0
1,bioterror,#world FedEx no longer to transport bioterror ...,0
6,trapped,Billionaires have a plan to free half a billio...,0
7,weapons,@JamesMelville Some old testimony of weapons u...,0
8,burning%20buildings,Ali you flew planes and ran into burning build...,0
...,...,...,...
6458,catastrophic,The best part of old baseball managers wearing...,0
6461,death,I feel like death,0
6462,epicentre,Epicentre - Cydia Tweak - https://t.co/WKmfDig...,0
6464,fear,Daily Reflections\r\nAugust 6\r\nDRIVEN\r\nDri...,0


In [None]:
df_majority = df_train_clean_copy[df_train_clean_copy['target'] == 0]
df_minority = df_train_clean_copy[df_train_clean_copy['target'] == 1]
 
# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=3665,    # to match majority class
                                 random_state=1234) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled.target.value_counts()


1    3665
0    3665
Name: target, dtype: int64

In [None]:
# Create balanced dataframe - base rate = 0.5 around text #single predictor
df_upsampled

Unnamed: 0,keyword,text,target
0,destroyed,Black Eye 9: A space battle occurred at Star O...,0
1,bioterror,#world FedEx no longer to transport bioterror ...,0
6,trapped,Billionaires have a plan to free half a billio...,0
7,weapons,@JamesMelville Some old testimony of weapons u...,0
8,burning%20buildings,Ali you flew planes and ran into burning build...,0
...,...,...,...
2671,blazing,Montgomery come for the blazing hot weather......,1
2451,outbreak,Families to sue over Legionnaires: More than 4...,1
6350,quarantined,Alabama firefighters quarantined after possibl...,1
865,natural%20disaster,Some people are really natural disaster too,1


In [None]:
X_train_blcd_p11 = df_upsampled['text'].values
Y_train_blcd_p11 = df_upsampled['target'].values
X_train_blcd_p11

array(['Black Eye 9: A space battle occurred at Star O784 involving 3 fleets totaling 3945 ships with 17 destroyed',
       '#world FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps  http://t.co/5zDbTktwW7',
       'Billionaires have a plan to free half a billion dollars trapped in Venezuela for two years @BlakeSchmidt reports.\r\n\r\nhttp://t.co/gbqTc7Sp9C',
       ...,
       'Alabama firefighters quarantined after possible Ebola exposure http://t.co/hzpX6vAQPZ reports http://t.co/L4W0PCorbs',
       'Some people are really natural disaster too',
       '@unsuckdcmetro minor train leaves rails. Major = 1/13/82 Smithsonian Interlocking derails &amp; crashes into barrier wall w/ fatalities.'],
      dtype=object)

In [None]:
Y_train_blcd_p11

array([0, 0, 0, ..., 1, 1, 1])

In [None]:
pipe_3.fit(X_train_blcd_p11, Y_train_blcd_p11)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_33 = pipe_3.predict(X_test)
y_pred_33

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_3.score(X_train_blcd_p11, Y_train_blcd_p11):.6f}")

ACCURACY SCORE:
0.925375


In [None]:
#Submission 2

target_pred = pd.DataFrame(y_pred_33, columns= ['target'])

target_pred.to_csv (r'./submission101_4Dec.csv', index = False, header=True)

### Two Predictors: Keyword

In [None]:
df_train_copy_1[df_train_copy_1['target'] == 0]

Unnamed: 0,target,Text With Keyword
0,0,destroyed Black Eye 9: A space battle occurred...
1,0,bioterror #world FedEx no longer to transport ...
6,0,trapped Billionaires have a plan to free half ...
7,0,weapons @JamesMelville Some old testimony of w...
8,0,burning%20buildings Ali you flew planes and ra...
...,...,...
6458,0,catastrophic The best part of old baseball man...
6461,0,death I feel like death
6462,0,epicentre Epicentre - Cydia Tweak - https://t....
6464,0,fear Daily Reflections\r\nAugust 6\r\nDRIVEN\r...


In [None]:
df_majority_p2 = df_train_copy_1[df_train_copy_1['target'] == 0]
df_minority_p2 = df_train_copy_1[df_train_copy_1['target'] == 1]
 
# Upsample minority class
df_minority_upsampled_1 = resample(df_minority_p2, 
                                 replace=True,     # sample with replacement
                                 n_samples=3665,    # to match majority class
                                 random_state=1234) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled_1 = pd.concat([df_majority_p2, df_minority_upsampled_1])
 
# Display new class counts
df_upsampled_1.target.value_counts()



1    3665
0    3665
Name: target, dtype: int64

In [None]:
X_train_upblcd_p2 = df_upsampled_1['Text With Keyword'].values
Y_train_upblcd_p2 = df_upsampled_1['target'].values
X_train_upblcd_p2

array(['destroyed Black Eye 9: A space battle occurred at Star O784 involving 3 fleets totaling 3945 ships with 17 destroyed',
       'bioterror #world FedEx no longer to transport bioterror germs in wake of anthrax lab mishaps  http://t.co/5zDbTktwW7',
       'trapped Billionaires have a plan to free half a billion dollars trapped in Venezuela for two years @BlakeSchmidt reports.\r\n\r\nhttp://t.co/gbqTc7Sp9C',
       ...,
       'quarantined Alabama firefighters quarantined after possible Ebola exposure http://t.co/hzpX6vAQPZ reports http://t.co/L4W0PCorbs',
       'natural%20disaster Some people are really natural disaster too',
       'fatalities @unsuckdcmetro minor train leaves rails. Major = 1/13/82 Smithsonian Interlocking derails &amp; crashes into barrier wall w/ fatalities.'],
      dtype=object)

In [None]:

# Fit model on training set
pipe_3.fit(X_train_upblcd_p2, Y_train_upblcd_p2)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_up_blc = pipe_3.predict(X_test)
y_pred_up_blc

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_3.score(X_train_upblcd_p2, Y_train_upblcd_p2):.6f}")

ACCURACY SCORE:
0.908322


In [None]:
#Submission 

target_pred = pd.DataFrame(y_pred_up_blc, columns= ['target'])

target_pred.to_csv (r'./submission11_4Dec.csv', index = False, header=True)

### Two Predictors: location

In [None]:
df_train_copy_3[df_train_copy_3['target'] == 0]

Unnamed: 0,id,keyword,location,text,target,Text With location
0,3738,destroyed,USA,Black Eye 9: A space battle occurred at Star O...,0,USA Black Eye 9: A space battle occurred at St...
6,9772,trapped,New York City,Billionaires have a plan to free half a billio...,0,New York City Billionaires have a plan to free...
7,10361,weapons,Multinational *****,@JamesMelville Some old testimony of weapons u...,0,Multinational ***** @JamesMelville Some old te...
8,1953,burning%20buildings,Los Angeles,Ali you flew planes and ran into burning build...,0,Los Angeles Ali you flew planes and ran into b...
11,7725,panicking,UK,My dad is panicking as my weight loss means he...,0,UK My dad is panicking as my weight loss means...
...,...,...,...,...,...,...
6450,8968,storm,"Johns Creek, GA",this storm????,0,"Johns Creek, GA this storm????"
6452,6410,hurricane,Somewhere Powerbraking A Chevy,@Freegeezy17 you stay in Houston?,0,Somewhere Powerbraking A Chevy @Freegeezy17 yo...
6457,3604,desolation,New York,The Hobbit Desolation of Smaug Thranduil 4' sc...,0,New York The Hobbit Desolation of Smaug Thrand...
6458,2176,catastrophic,"San Jose, CA",The best part of old baseball managers wearing...,0,"San Jose, CA The best part of old baseball man..."


In [None]:
df_majority_p2_loc = df_train_copy_3[df_train_copy_3['target'] == 0]
df_minority_p2_loc = df_train_copy_3[df_train_copy_3['target'] == 1]
 
# Upsample minority class
df_minority_upsampled_2 = resample(df_minority_p2_loc, 
                                 replace=True,     # sample with replacement
                                 n_samples=2464,    # to match majority class
                                 random_state=1234) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled_2 = pd.concat([df_majority_p2_loc, df_minority_upsampled_2])
 
# Display new class counts
df_upsampled_2.target.value_counts()




1    2464
0    2464
Name: target, dtype: int64

In [None]:
X_train_upblcd_p2_loc = df_upsampled_2['Text With location'].values
Y_train_upblcd_p2_loc = df_upsampled_2['target'].values
X_train_upblcd_p2_loc

array(['USA Black Eye 9: A space battle occurred at Star O784 involving 3 fleets totaling 3945 ships with 17 destroyed',
       'New York City Billionaires have a plan to free half a billion dollars trapped in Venezuela for two years @BlakeSchmidt reports.\r\n\r\nhttp://t.co/gbqTc7Sp9C',
       'Multinational ***** @JamesMelville Some old testimony of weapons used to promote conflicts\r\nTactics - corruption &amp; infiltration of groups\r\nhttps://t.co/cyU8zxw1oH',
       ...,
       'Zimbabwe \x89Û÷Good Samaritans\x89Ûª shot in horror hijacking http://t.co/V5yUUALoqw #263Chat #Twimbos ZimpapersViews',
       "Dil's Campsite @Dead_Dreamer15 ...because if it were on fire that'd be a safety hazard",
       'oman muscat al seeb  I liked a @YouTube video http://t.co/43sXG9Z6xh TREMOR IS NO JOKE!! [TREMOR DLC] [FATALITIES/X-RAY]'],
      dtype=object)

In [None]:

# Fit model on training set
pipe_3.fit(X_train_upblcd_p2_loc, Y_train_upblcd_p2_loc)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_up_blc = pipe_3.predict(X_test)
y_pred_up_blc

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_3.score(X_train_upblcd_p2_loc, Y_train_upblcd_p2_loc):.6f}")

ACCURACY SCORE:
0.939123


In [None]:
#Submission 

target_pred = pd.DataFrame(y_pred_up_blc, columns= ['target'])

target_pred.to_csv (r'./submission12_4Dec.csv', index = False, header=True)

### Three Predictors

In [None]:
df_train_copy_2[df_train_copy_2['target'] == 0]

Unnamed: 0,target,Text With Keyword and Location
0,0,USA destroyed Black Eye 9: A space battle occu...
6,0,New York City trapped Billionaires have a plan...
7,0,Multinational ***** weapons @JamesMelville Som...
8,0,Los Angeles burning%20buildings Ali you flew p...
11,0,UK panicking My dad is panicking as my weight ...
...,...,...
6450,0,"Johns Creek, GA storm this storm????"
6452,0,Somewhere Powerbraking A Chevy hurricane @Free...
6457,0,New York desolation The Hobbit Desolation of S...
6458,0,"San Jose, CA catastrophic The best part of old..."


In [None]:
df_majority_p3 = df_train_copy_2[df_train_copy_2['target'] == 0]
df_minority_p3 = df_train_copy_2[df_train_copy_2['target'] == 1]
 
# Upsample minority class
df_minority_upsampled_3 = resample(df_minority_p3, 
                                 replace=True,     # sample with replacement
                                 n_samples=2464,    # to match majority class
                                 random_state=1234) # reproducible results
 
# Combine majority class with upsampled minority class
df_upsampled_3 = pd.concat([df_majority_p3, df_minority_upsampled_3])
 
# Display new class counts
df_upsampled_3.target.value_counts()


1    2464
0    2464
Name: target, dtype: int64

In [None]:
X_train_upblcd_p3 = df_upsampled_3['Text With Keyword and Location'].values
Y_train_upblcd_p3 = df_upsampled_3['target'].values
X_train_upblcd_p3

array(['USA destroyed Black Eye 9: A space battle occurred at Star O784 involving 3 fleets totaling 3945 ships with 17 destroyed',
       'New York City trapped Billionaires have a plan to free half a billion dollars trapped in Venezuela for two years @BlakeSchmidt reports.\r\n\r\nhttp://t.co/gbqTc7Sp9C',
       'Multinational ***** weapons @JamesMelville Some old testimony of weapons used to promote conflicts\r\nTactics - corruption &amp; infiltration of groups\r\nhttps://t.co/cyU8zxw1oH',
       ...,
       'Zimbabwe hijacking \x89Û÷Good Samaritans\x89Ûª shot in horror hijacking http://t.co/V5yUUALoqw #263Chat #Twimbos ZimpapersViews',
       "Dil's Campsite hazard @Dead_Dreamer15 ...because if it were on fire that'd be a safety hazard",
       'oman muscat al seeb  fatalities I liked a @YouTube video http://t.co/43sXG9Z6xh TREMOR IS NO JOKE!! [TREMOR DLC] [FATALITIES/X-RAY]'],
      dtype=object)

In [None]:

# Fit model on training set
pipe_3.fit(X_train_upblcd_p3, Y_train_upblcd_p3)

Pipeline(memory=None,
         steps=[('vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 1), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=None, strip_accents=None,
                                 sublinear_tf=False,
                                 token_patt...b',
                                 tokenizer=<function spacy_tokenizer at 0x7f671c5e2488>,
                                 use_idf=True, vocabulary=None)),
                ('classifier',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit

In [None]:
y_pred_up3_blc = pipe_3.predict(X_test)
y_pred_up3_blc

array([0, 0, 1, ..., 1, 0, 1])

In [None]:
# Evaluate on training set
print(f"ACCURACY SCORE:\n{pipe_3.score(X_train_upblcd_p3, Y_train_upblcd_p3):.6f}")

In [None]:
#Submission 

target_pred = pd.DataFrame(y_pred_up_blc, columns= ['target'])



target_pred.to_csv (r'./submission13_4Dec.csv', index = False, header=True)