# NLP Project Kaggle


# project description 

Competition Description

Twitter has become an important communication channel in times of emergency.
The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies).

But, it’s not always clear whether a person’s words are actually announcing a disaster. Take this example:





The author explicitly uses the word “ABLAZE” but means it metaphorically. This is clear to a human right away, especially with the visual aid. But it’s less clear to a machine.

In this competition, you’re challenged to build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. You’ll have access to a dataset of 10,000 tweets that were hand classified. If this is your first time working on an NLP problem, we've created a quick tutorial to get you up and running.

Disclaimer: The dataset for this competition contains text that may be considered profane, vulgar, or offensive.
Acknowledgments

This dataset was created by the company figure-eight and originally shared on their ‘Data For Everyone’ website here.

Tweet source: https://twitter.com/AnyOtherAnnaK/status/629195955506708480

In [2]:
# import libraries
import pandas as pd 
import numpy as np 
import seaborn as sns 
import matplotlib.pyplot  as plt 
%matplotlib inline

In [3]:
#Read the data test ans train 
train = pd.read_csv("./train.csv")
test = pd.read_csv("./test.csv")
print(train.shape, test.shape) #, sub_sample.shape)

(7613, 5) (3263, 4)


In [4]:
train.head()

Unnamed: 0,id,keyword,location,text,target
0,1,,,Our Deeds are the Reason of this #earthquake M...,1
1,4,,,Forest fire near La Ronge Sask. Canada,1
2,5,,,All residents asked to 'shelter in place' are ...,1
3,6,,,"13,000 people receive #wildfires evacuation or...",1
4,7,,,Just got sent this photo from Ruby #Alaska as ...,1


In [5]:
test.head()

Unnamed: 0,id,keyword,location,text
0,0,,,Just happened a terrible car crash
1,2,,,"Heard about #earthquake is different cities, s..."
2,3,,,"there is a forest fire at spot pond, geese are..."
3,9,,,Apocalypse lighting. #Spokane #wildfires
4,11,,,Typhoon Soudelor kills 28 in China and Taiwan


In [6]:
print(train.shape, test.shape)

(7613, 5) (3263, 4)


### Take only the two columns relavant for the model here: target and text column

In [7]:
# # select only columns that are relevants for us ie target and test
train = train[['text', 'target']]
# # train.head()
# # train = train.loc[:,['text', 'target']]

# # # train = train.loc[:, ['text', 'target']]  # use only the two relevant columns

# # test = test.loc[:, ['text']]
test = test[['text']]
# # test.head()
print(test.shape, train.shape)

(3263, 1) (7613, 2)


In [8]:
test = test[['text']]
test.head()

Unnamed: 0,text
0,Just happened a terrible car crash
1,"Heard about #earthquake is different cities, s..."
2,"there is a forest fire at spot pond, geese are..."
3,Apocalypse lighting. #Spokane #wildfires
4,Typhoon Soudelor kills 28 in China and Taiwan


In [11]:
# test.isnull().sum()
train.isnull().sum()

# we have no null values on test and train data

text      0
target    0
dtype: int64

# EDA 

* Lower case of the text
* remmover Html
* Remove special characters
* Remover numbers
* PLsit intor words (Tokenized the text)
* Lematisation or stemmatisation


In [101]:
%%writefile Def_Clean_text_NLP.py

# Load necessary libvraries
import nltk 
import re 
from nltk.tokenize import RegexpTokenizer
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from bs4 import BeautifulSoup 


# write a function that clean the text 
def text_cleaning(text):
#     text = str(text)
    
    # lower character
    text1 = text.lower()
    
    # remmover Html
    text2 = text1.replace('{html}', " ")
    
    # Remove special characters
    cleanr = re.compile('<,*?')
    text3 = re.sub(cleanr, '', text2)
    
    # Remover numbers
    text4 = re.sub('[0-9]+', '', text3)
    
    # tokenized the text
    tokenizer = RegexpTokenizer(r'\w+')
    tokens =  tokenizer.tokenize(text4)
    
    #  remove stop words from the tokenized text
    filtered_words = [w for w in tokens if len(w)>2 if not w in stopwords.words('english')]
    
    # Lematisation 
    lemma_words = [WordNetLemmatizer().lemmatize(w) for w in filtered_words] 
    
    # join all the words back into a cleaned text
    join_words  = ' '.join(lemma_words)
    
    # return the cleaned text
    return join_words
    
# ## aplly this to the data frame 

# data['cleaned_text'] = data['text'].map(lambda s: Preprocessing(s))
# # pd.set_option('display.max_colwidth', None) #enable see full text in rows.

# # # 3h. Print the first 5 rows of the datas after preprocessing.
# data.head() 
    

Overwriting Def_Clean_text_NLP.py


In [29]:
# Combune train and test data to form totatal data to be cleaned 

# we can concatenate train and test

frame = [train, test]
data = pd.concat(frame, keys =['text', 'target'])

print(data.shape, train.shape, test.shape)
# = pd.concat(frames, keys=["x", "y", "z"])

(10876, 2) (7613, 2) (3263, 1)


of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


In [111]:
# we can now apply the clean text on our whole data using the prewriten function

# write a function that clean the text 
def text_cleaning(text):
#     text = str(text)
    
#     # lower character
    text1 = text.lower()
    
#     # remmover Html
    text2 = text1.replace('{html}', " ")
    
#     # Remove special characters
    cleanr = re.compile('<,*?')
    text3 = re.sub(cleanr, '', text2)
    
#     # Remover numbers
    text4 = re.sub('[0-9]+', '', text3)
    
#     # tokenized the text
    tokenizer = RegexpTokenizer(r'\w+')
    tokens =  tokenizer.tokenize(text4)
    
#     #  remove stop words from the tokenized text
    filtered_words = [w for w in tokens if len(w)>2 if not w in stopwords.words('english')]
    
#     # Lematisation 
    lemma_words = [WordNetLemmatizer().lemmatize(w) for w in filtered_words] 
    
#     # join all the words back into a cleaned text
    join_words  = ' '.join(lemma_words)
    
    # return the cleaned text
    return join_words
    

data['cleaned_text']= data['text'].map(lambda s : text_cleaning(str(s)))
data.head()


Unnamed: 0,Unnamed: 1,target,text,cleaned_text
text,0,1.0,Our Deeds are the Reason of this #earthquake M...,deed reason earthquake may allah forgive
text,1,1.0,Forest fire near La Ronge Sask. Canada,forest fire near ronge sask canada
text,2,1.0,All residents asked to 'shelter in place' are ...,resident asked shelter place notified officer ...
text,3,1.0,"13,000 people receive #wildfires evacuation or...",people receive wildfire evacuation order calif...
text,4,1.0,Just got sent this photo from Ruby #Alaska as ...,got sent photo ruby alaska smoke wildfire pour...


In [117]:
data.target.unique()

array([ 1.,  0., nan])

# Convert the text into vectors to build the model with 

## Vectorization 

* Use here Back of words to convert each words into vectors 

In [119]:
# load the appropriate library 

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(max_features= 1000)
features = vectorizer.fit_transform(data['cleaned_text'])
features = features.toarray()

features  # for all data in general

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [120]:
print(features.shape)

(10876, 1000)


(7613, 1000)

## Split back the data to the trained and test data cleaned


In [131]:
# Split back the data to the trained and test data cleaned

d = data.drop('text', axis = 'columns')
train_cleaned = d[:len(train)]
# print(train.shape, train_cleaned.shape)
test_cleaned = d[len(train):].drop('target', axis ='columns')
# print(test.shape, test_cleaned.shape)


In [132]:
# we can save the test_cleaned and train_cleaned data into pickle so that we do not have to clean up the data later on.

from sklearn.externals import joblib

joblib.dump(test_cleaned, 'test_cleaned.pkl')
joblib.dump(train_cleaned, 'train_cleaned.pkl')


# # load back the pickle data saved.
# data = joblib.load('train_cleaned.pkl')




['train_cleaned.pkl']

In [159]:
# Split the training data to build the model

from sklearn.model_selection import train_test_split  # split data 


# convert train and test cleaned data into vectors first 

# train cleaned
vectorizer = CountVectorizer(max_features= 1000)
features_train = vectorizer.fit_transform(train_cleaned['cleaned_text'])
x_train = features_train.toarray()


# test cleaned 
features_test = vectorizer.fit_transform(test_cleaned['cleaned_text'])
X_test = features_test.toarray()

# let us assign the 30% of the data for validation
# x_train = train_cleaned['cleaned_text']
y_train = train_cleaned['target']

X_train, X_val, Y_train, Y_val = train_test_split(x_train, y_train, test_size = .30, random_state=0)

X_test = test_cleaned['cleaned_text']

print(X_train.shape, X_val.shape, X_test.shape)


(5329, 1000) (2284, 1000) (3263,)


# Build the model 

In [178]:
### %%writefile Build_Pipeline.py

# import the libraries usefull to build the classification models


# to make the pipelines
from sklearn.pipeline  import make_pipeline

# classifications models
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression


# cross validation to validate the model
from sklearn.model_selection import cross_val_score

# To scale the data before fiting the model
from sklearn.preprocessing import MinMaxScaler # or scale bwte [0,1].. used in non statistical context
from sklearn.preprocessing import StandardScaler # return z-score


#  Build the pipeline for all the models in one

pipe_rfc = make_pipeline(StandardScaler(),
                        RandomForestClassifier(criterion='gini',
                                               min_impurity_decrease= 0.0,
                                               n_estimators = 100,
                                               random_state = None
                                              )
                        )

pipe_lr = make_pipeline(StandardScaler(),
                       LogisticRegression(random_state = 2))

pipe_svc = make_pipeline(StandardScaler(), 
                        SVC(kernel= 'rbf', C =30, gamma ='auto' )
                        )

pipe_ada = make_pipeline(StandardScaler(),
                        AdaBoostClassifier(n_estimators=100, 
                                           random_state=0
                                          ))

pipe_bag = make_pipeline(StandardScaler(),
                        BaggingClassifier())

pipe_gboost = make_pipeline(StandardScaler(),
                           GradientBoostingClassifier())

pipe_knn = make_pipeline(StandardScaler(),
                        KNeighborsClassifier())


# All together in one pipeline
pipelines = [pipe_lr, 
             pipe_rfc, 
             pipe_svc, 
             pipe_ada, 
             pipe_bag, 
             pipe_gboost, 
             pipe_knn]


# Build a pipeline for each models here

best_accuracy = 0.0
best_classifier = 0
best_pipeline = ""


# Dictionary  of pipelines and clasfiers types  for ease  reference
pip_dict ={0:'Logistic Regression', 
           1:'Random Forest Classifier',
           2: 'Support Vector Machine Classifier',
           3: 'AdaBoost Classifier',
           4: 'Bagging Classifier',
           5: 'Gradient boost Classifier',
           6: 'KNeighbors Classifier'
          }


# Fit all the pipelines
for pipe in  pipelines:
    pipe.fit(X_train, Y_train)

# for all pipelines print the accuracy result to see which one has better performance

for i , model in enumerate(pipelines):
    print("{} Test accuracy: {} and  Validation {}".format(pip_dict[i], model.score(X_train, Y_train), model.score(X_val, Y_val)))



STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


Logistic Regression Test accuracy: 0.8695815349971853 and  Validation 0.781523642732049
Random Forest Classifier Test accuracy: 0.9769187464815162 and  Validation 0.792031523642732
Support Vector Machine Classifier Test accuracy: 0.9733533495965472 and  Validation 0.7727670753064798
AdaBoost Classifier Test accuracy: 0.7969600300243949 and  Validation 0.7797723292469352
Bagging Classifier Test accuracy: 0.9605929817977107 and  Validation 0.760507880910683
Gradient boost Classifier Test accuracy: 0.7828860949521487 and  Validation 0.76138353765324
KNeighbors Classifier Test accuracy: 0.8055920435353725 and  Validation 0.7377408056042032


# Tune the model

* Random forest over fit the data,
* Ada dont perform very well, but do not overfit the data
* GBoost also dont perform very well, but do not overfit 
* 

* we can perform Gridseach hyper parameters for RF, SVM, to see if we can better our model without overfiting the data too m,uch.  If not work, try to improve the model that does not overgifit as Ada 


We can try to optimise the adaboost and/or try to perform ensemble techniques, like mixing models to optimize the model performance.

#### Tune the better model to improve the accuracy of the model
#### we can use  Gridsearch 


* An important hyperparameter for AdaBoost algorithm is the number of decision trees used in the ensemble.

* Recall that each decision tree used in the ensemble is designed to be a weak learner. That is, it has skill over random prediction, but is not highly skillful. As such, one-level decision trees are used, called decision stumps.

* The number of trees added to the model must be high for the model to work well, often hundreds, if not thousands.


In [None]:
## Model performance and choose the best model to be used.

# from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV


# Initialise a dictionary with the best classification models here

model_params = {
    'svm' :{
        'model': SVC(gamma ='auto'),
        'params':{
                  'C':[1,10,20],
                   'kernel': ['rbf', 'linear']
                 }
           },
            'random_forest':{
            'model':RandomForestClassifier(),
              'params':{
            'n_estimators':[1,5,10]
                        }
           },
           'logistic_regression':{
            'model': LogisticRegression(solver ='liblinear',multi_class ='auto'),
            'params':{
                'C':[1,5,10]
                     }
     }               
}


scores = []
for model_name, mp in model_params.items():
    clf =  RandomizedSearchCV(mp['model'],
                      mp['params'],
                      cv = 5,  # cv = cross validation number
                      return_train_score = False,
                      n_iter =2)  # nur on RandomSearch.. Gridsearch takes too much time.
    clf.fit(X_train, Y_train)
    
    scores.append({
                   'model': model_name,
                    'best_score': clf.best_score_,
                    'best_params': clf.best_params_
                 })
d = pd.DataFrame(scores ,columns =['model', 'best_score', 'best_params'])
d

# HERREEE STOP

In [165]:
Ada = AdaBoostClassifier()
Ada.get_params()

{'algorithm': 'SAMME.R',
 'base_estimator': None,
 'learning_rate': 1.0,
 'n_estimators': 50,
 'random_state': None}

In [168]:
RFC = RandomForestClassifier()
RFC.get_params()  # get all the hyper parameters for this model


{'bootstrap': True,
 'ccp_alpha': 0.0,
 'class_weight': None,
 'criterion': 'gini',
 'max_depth': None,
 'max_features': 'auto',
 'max_leaf_nodes': None,
 'max_samples': None,
 'min_impurity_decrease': 0.0,
 'min_impurity_split': None,
 'min_samples_leaf': 1,
 'min_samples_split': 2,
 'min_weight_fraction_leaf': 0.0,
 'n_estimators': 100,
 'n_jobs': None,
 'oob_score': False,
 'random_state': None,
 'verbose': 0,
 'warm_start': False}

In [169]:
# Tune hyper parameters for the best performed model Svc 

svc = SVC()
svc.get_params()  # get all the hyper parameters for this model


{'C': 1.0,
 'break_ties': False,
 'cache_size': 200,
 'class_weight': None,
 'coef0': 0.0,
 'decision_function_shape': 'ovr',
 'degree': 3,
 'gamma': 'scale',
 'kernel': 'rbf',
 'max_iter': -1,
 'probability': False,
 'random_state': None,
 'shrinking': True,
 'tol': 0.001,
 'verbose': False}

HERREEE 

In [None]:
# build a pipeline for this model with randomsearch to tweak the nmodel by selecting the better hyperparameter and improve the performance of the model.
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV


# SVC
svc_clf = GridSearchCV(SVC(gamma  = 'auto'),
                       {'C': [1,10,20], 
                        'kernel': ['rbf', 'linear', 'poly']
                       },
                       cv =5,
                       return_train_score=False)
svc_clf.fit(X_train, Y_train)
svc_clf.cv_result_   # print the cross validation result 

# export the result into pandas
df_scv_tuning = pdf.DataFrame(svc_clf.cv_result_ )
df_scv_tuning

In [None]:
# export the result into pandas
df_scv_tuning = pdf.DataFrame(svc_clf.cv_result_ )
df_scv_tuning

In [77]:

# specify parameters and distributions to sample from
param_grid_svc = {'svc__C': [0.001, 0.01, 0.1, 1, 10],
                  'svc__gamma': [0.001, 0.01, 0.1, 1], 
                  'svc__kernel':['rbf','poly']
                 }

pipe_svc = make_pipeline(MinMaxScaler(), (SVC())) 

## grid = GridSearchCV(pipe_svc , param_grid = param_grid_svc, cv = 5) 

samples = 10
randomcv = RandomizedSearchCV(pipe_svc, param_distributions =param_grid_svc, n_iter=samples, cv =5)
randomcv.fit( X_train, Y_train) 

print(" Best cross-validation accuracy: {:.2f}". format( randomcv.best_score_)) 
print(" Best parameters: ", randomcv.best_params_) 
print(" Test set accuracy: {:.2f}". format( randomcv.score( X_val, Y_val)))

 

 Best cross-validation accuracy: 0.72
 Best parameters:  {'svc__kernel': 'rbf', 'svc__gamma': 1, 'svc__C': 10}
 Test set accuracy: 0.75


 le score a diminuer au lieu dimprouver?? whyyy??

In [78]:
best_random = randomcv.best_estimator_
best_random.score(X_val , Y_val)

0.7530647985989493

In [86]:
# use those hyperparameters to build the model
from sklearn.svm import SVC 

pipe_svc2 = make_pipeline(MinMaxScaler(), SVC(kernel= 'rbf', gamma= 1, C= 10))
pipe_svc2.fit(X_train, Y_train)
print(pipe_svc2.score(X_train, Y_train), pipe_svc2.score(X_val, Y_val))

# ooopss we have humm like overfiting ... Noit good.. When we tune the model, it overfit?? grrr

0.9748545693375867 0.7530647985989493


### Tune another model, to try to improive the score


In [93]:
# randomForest clkadsifier tuning
rfc = RandomForestClassifier() # Look at parameters used by our current forest
print('Parameters currently in use:\n')
print(rfc.get_params())


Parameters currently in use:

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}


In [101]:
# define grid of hyper parameters
# specify parameters and distributions to sample from
from scipy.stats import randint as sp_randint

rfcl = RandomForestClassifier(n_estimators=50)
param_rfcl = {"max_depth": [3, None],
              "max_features": sp_randint(1, 11),
              "min_samples_split": sp_randint(2, 11),
              "min_samples_leaf": sp_randint(1, 11),
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# run randomized search
samples = 10  # number of random samples 
randomCV = RandomizedSearchCV(rfcl, param_distributions=param_rfcl, n_iter=samples, cv =5) #default cv = 3

randomCV.fit(X_train, Y_train)

# evaluer le model
print(" Best estimator:", randomCV.best_estimator_)
print(" Best cross-validation accuracy: {:.2f}". format( randomCV.best_score_)) 
print(" Best parameters: ", randomCV.best_params_) 
print(" Test set accuracy train: {:.2f}". format(randomCV.score( X_train, Y_train)))
print(" Test set accuracy val: {:.2f}". format(randomCV.score( X_val, Y_val)))


 Best estimator: RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features=3,
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=2, min_samples_split=6,
                       min_weight_fraction_leaf=0.0, n_estimators=50,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)
 Best cross-validation accuracy: 0.78
 Best parameters:  {'bootstrap': True, 'criterion': 'gini', 'max_depth': None, 'max_features': 3, 'min_samples_leaf': 2, 'min_samples_split': 6}
 Test set accuracy train: 0.82
 Test set accuracy val: 0.80


In [108]:
# Ok, now we dont see an overfit ..and the performance has improved a little bit
# use those best parameter to build the final model r

pipe_RF = make_pipeline(MinMaxScaler(), RandomForestClassifier(n_estimators=100,
                                                                criterion ='gini', 
                                                               max_depth = None, 
                                                               max_features = 3,
                                                              min_samples_leaf = 2, 
                                                               min_samples_split = 6)
                       )

# fit the pipeline
pipe_RF.fit(X_train, Y_train)

# evaluer le model
print("{} Test accuracy: {} {}".format(pipe_RF, pipe_RF.score(X_train, Y_train), pipe_RF.score(X_val, Y_val)))


Pipeline(memory=None,
         steps=[('minmaxscaler', MinMaxScaler(copy=True, feature_range=(0, 1))),
                ('randomforestclassifier',
                 RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                        class_weight=None, criterion='gini',
                                        max_depth=None, max_features=3,
                                        max_leaf_nodes=None, max_samples=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=2, min_samples_split=6,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=100, n_jobs=None,
                                        oob_score=False, random_state=None,
                                        verbose=0, warm_start=False))],
         verbose=False) Test accuracy: 0.819290673672

In [None]:
#  Deploy this and see the result

# y_pred = pipe_RF.predict(X_test)

In [109]:
y_predict = np.floor(np.expm1(pipe_RF.predict(x_test)))  # np.expm1 to have better precisiion in the number.


In [128]:
y_predict = [int(i) for i in y_predict]  # convert all into int type


In [129]:
# create the submission file

sub = pd.DataFrame()
sub['id'] = test_id
sub['target'] = y_predict
sub.to_csv('submission1.csv',index=False)

#### This submission gives me a score of  0.78 in Kaggle , Hummm, Cest pas bon du tout


In [None]:
#Hin:  tune all the models and combine them to have a strong predictor model


# HERRRE

In [273]:
# choose 1 model to submit with
model =   # for exple
y_predict = np.floor(np.expm1(model.predict(x_test)))  # np.expm1 to have better precisiion in the number.


sub = pd.DataFrame()
sub['id'] = test_id
sub['target'] = y_predict
sub.to_csv('submission_'+str(model)+'.csv',index=False)

In [None]:
# Dea\l with missing values, look at skewness,
#  Now slit not  with the function split but simply in the same order you combiuned them
# Slit into train and test data set 


# A la fin save in two columns 
# y_predict = np.floor(np.expm1(xgb_model.predict(x_test)))

# sub = pd.DataFrame()
# sub['Id'] = test_id
# sub['SalePrice'] = y_predict
# sub.to_csv('submission2_xgboost.csv',index=False)

# ou bien ceci
# result = model.predict(test_data_features)

#  output = pd.DataFrame(data ={"id":test["id"], "sentiment": result})
#  write teh file
#  output.to_csv("name_submission_file.csv", index =False, quoting =3)

 A lot of missing values 
 
 visualize missing values using ggplot2
 
 https://www.kaggle.com/elenapetrova/joyful-analysis-of-death-causes/data