# Project - Data Mining and Machine Learning
## Project description 
Real or Not? NLP with Disaster Tweets: In this project you are challenged to build a Machine Learning model that can predict which tweets are about a real disaster and which are not. The project topic is based around a Kaggle competition.

## Team members
- Stéphane Vez
- Maël Maceiras
- Pierre Huber



## Libraries

In [1]:
# Import requiered packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import folium
from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
from sklearn.base import TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.linear_model import LogisticRegression
import spacy
import string
from tqdm import tqdm
from sklearn.metrics import confusion_matrix, accuracy_score, precision_score, recall_score, f1_score



%matplotlib inline
sns.set_style("whitegrid")

In [2]:
!python -m spacy download en


Collecting en_core_web_sm==2.3.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.3.1/en_core_web_sm-2.3.1.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 42.9 MB/s 
Building wheels for collected packages: en-core-web-sm
  Building wheel for en-core-web-sm (setup.py) ... [?25ldone
[?25h  Created wheel for en-core-web-sm: filename=en_core_web_sm-2.3.1-py3-none-any.whl size=12047106 sha256=fa24dd59039caf0d9402cf70d72075ff8801bf65553590edebf051e33a7fdbe8
  Stored in directory: /tmp/pip-ephem-wheel-cache-vzr6nld0/wheels/b7/0d/f0/7ecae8427c515065d75410989e15e5785dd3975fe06e795cd9
Successfully built en-core-web-sm
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-2.3.1
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/opt/venv/lib/python3.7/site-packages/en_core_web_sm -->
/opt/venv/lib/python3.7/sit

## Data


In [3]:
sample = pd.read_csv('https://raw.githubusercontent.com/PierreHuber/DMML_Apple/main/data/sample_submission.csv')
test = pd.read_csv("https://raw.githubusercontent.com/PierreHuber/DMML_Apple/main/data/test_data.csv")
test=test.set_index(test.id).drop(columns=['id'])
training = pd.read_csv('https://raw.githubusercontent.com/PierreHuber/DMML_Apple/main/data/training_data.csv')
training=training.set_index(training.id).drop(columns=['id'])


In [4]:
training.head(10)

Unnamed: 0_level_0,keyword,location,text,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
3738,destroyed,USA,Black Eye 9: A space battle occurred at Star O...,0
853,bioterror,,#world FedEx no longer to transport bioterror ...,0
10540,windstorm,"Palm Beach County, FL",Reality Training: Train falls off elevated tra...,1
5988,hazardous,USA,#Taiwan Grace: expect that large rocks trees m...,1
6328,hostage,Australia,New ISIS Video: ISIS Threatens to Behead Croat...,1
6669,landslide,Scotland,FreeBesieged: .MartinMJ22 YouGov Which '#Tory ...,1
9772,trapped,New York City,Billionaires have a plan to free half a billio...,0
10361,weapons,Multinational *****,@JamesMelville Some old testimony of weapons u...,0
1953,burning%20buildings,Los Angeles,Ali you flew planes and ran into burning build...,0
9586,thunder,,The thunder shook my house woke my sister and ...,1


In [5]:
#plot x=keywords y=target, size_p=count


## Functions

In [6]:
def spacy_tokenizer(sentence):
    mytokens = sp(sentence)
    mytokens = [ word.lemma_.lower().strip() if word.lemma_ != "-PRON-" else word.lower_ for word in mytokens ]
    mytokens = [ word for word in mytokens if word not in stop_words and word not in punctuations ]
    return mytokens

In [7]:
def evaluate(true, pred):
    precision = precision_score(true, pred)
    recall = recall_score(true, pred)
    f1 = f1_score(true, pred)
    print(f"CONFUSION MATRIX:\n{confusion_matrix(true, pred)}")
    print(f"ACCURACY SCORE:\n{accuracy_score(true, pred):.4f}")
    print(f"CLASSIFICATION REPORT:\n\tPrecision: {precision:.4f}\n\tRecall: {recall:.4f}\n\tF1_Score: {f1:.4f}")

## Base rate

In [8]:
print(training.target.value_counts())
print('Baserate: ', round(max(training.target.value_counts()[0],training.target.value_counts()[1])/(training.target.value_counts()[0]+training.target.value_counts()[1]),2))


0    3701
1    2770
Name: target, dtype: int64
Baserate:  0.57


## Sub 1

In [9]:
#text tokenizer: stopwords, punctuation, lowercase, lemmatize

sp = spacy.load('en')
punctuations = string.punctuation
stop_words = spacy.lang.en.stop_words.STOP_WORDS

tweets_token=[]
for tweet in tqdm(training.text):
    tweets_token.append(spacy_tokenizer(tweet))
tokenSerie=[]
for i in range(0, len(tweets_token)):
    tokens=""
    for j in tweets_token[i]:
        tokens+=j
        tokens+=" "
    tokenSerie.append(tokens)
training['tokens']=tokenSerie

100%|██████████| 6471/6471 [00:52<00:00, 122.55it/s]


In [10]:
X_train = training.tokens
y_train = training.target
X_test = test.text

In [11]:
#TF-IDF + pipe
tfidf_vector=TfidfVectorizer(tokenizer=spacy_tokenizer)

classifier=LogisticRegression(solver='lbfgs')#, max_iter=1000, random_state=72)
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifier)])
pipe.fit(X_train, y_train)

#train accuracy:
print('Train accuracy: ', round(pipe.score(X_train, y_train),2))

#Sub1:
y_pred=pipe.predict(X_test)
sample.target=y_pred
sample.to_csv(r'sub1.csv', index = False)

Train accuracy:  0.89


## Sub 2

In [22]:
#LogCV
classifierCV = LogisticRegressionCV(solver='lbfgs', cv=5, max_iter=1000)
pipe = Pipeline([('vectorizer', tfidf_vector),
                 ('classifier', classifierCV)])
pipe.fit(X_train, y_train)

#train accuracy:
print('Train accuracy: ', round(pipe.score(X_train, y_train),2))

#Sub1:
y_pred=pipe.predict(X_test)
sample.target=y_pred
sample.to_csv(r'sub2.csv', index = False)

Train accuracy:  0.94


## Cleaning

In [12]:
#keywords NaN

#location NaN

In [13]:
#location clean + encode
training.location.describe()

count     4330
unique    2921
top        USA
freq        91
Name: location, dtype: object

In [14]:
#logregCV

In [15]:
#improve text preparation: configs()

In [16]:
#resampling: base_rate=0.5

In [17]:
#randomForestClassifier

In [18]:
#dependency parsing: noun_chunks

In [19]:
#entity detection

In [20]:
#Word embedding
#Word2Vec
#GloVe

In [21]:
#KNN + Decision Tree