In [None]:
__author__ = "Pierre-Habté Nouvellon"
__email__ = "pierrehabte@snipfeed.co"
__status__ = "Challenge NLP"

# Intro:

One of the big challenge at Snipfeed is the NLP part. In fact, we have to develop a system that is capable of: 
- (i) recommending articles/videos to people based on their interaction
- (ii) giving context to an article (that is give content related to an article to allow the user to go deeper into a subject)
- (iii) giving the most relevant content following query (search engine).
- (iv) avoiding fake news
- (v) extracting the most important information from an article to transform it into a snip


The understanding and implementation of classic machine learning / deeplearning techniques is a requirement. This test is made to see the candidate's ability to learn new technics and apply it on a concrete examples.

# Fake news challenge

In this task you will have to build a ML model that gives a confidence score between 1 and 0 to an article (is this article reliable or not). The requirements are:

- train you model on the Kaggle fake news dataset https://www.kaggle.com/c/fake-news/data .
- provide the metrics to assess you model on the testing set (accuracy etc)
- build a final function give_score(link) that takes a link of an article as an input and outputs the confidence score. 
- compare 3 different ML technics. Parameter tuning is very important, so explain each step of the process in the notebook.

In [1]:
import pandas as pd
import numpy as np

train_set = pd.read_csv("/data/train.csv", index_col=0)
train_set = train_set.fillna(" ")

train_set.head()

Unnamed: 0_level_0,title,author,text,label
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [3]:
train_set["total"] = train_set["title"] + " " + train_set["author"] + train_set["text"]

In [4]:
"""Preping the data"""
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.model_selection import cross_val_score

In [6]:
count_vectorizer = CountVectorizer(ngram_range=(1,2))
transformer = TfidfTransformer()
counts = count_vectorizer.fit_transform(train_set["total"].values)
tfidf = transformer.fit_transform(counts)

In [7]:
targets = train_set["label"].values
features = tfidf

In [11]:
from sklearn.ensemble import RandomForestClassifier
random_forest_classifier = RandomForestClassifier(n_estimators=10, n_jobs=-1)
#random_forest_classifier.fit(features, targets)
accuracies = cross_val_score(estimator=random_forest_classifier, X=features, y=targets, cv=10)
mean_accuracy = accuracies.mean()
print(f"Cross validation mean accuracy: {mean_accuracy}")

Cross validation mean accuracy: 0.8577894118980136


In [12]:
from sklearn.ensemble import AdaBoostClassifier
ada_boost_classifier = AdaBoostClassifier()
accuracies = cross_val_score(estimator=ada_boost_classifier, X=features, y=targets, cv=10)
mean_accuracy = accuracies.mean()
print(f"Cross validation mean accuracy: {mean_accuracy}")

Cross validation mean accuracy: 0.9728357258408916


In [15]:
from sklearn.ensemble import GradientBoostingClassifier
grad_boost_classifier = GradientBoostingClassifier(n_estimators=10)
accuracies = cross_val_score(estimator=grad_boost_classifier, X=features, y=targets, cv=10)
mean_accuracy = accuracies.mean()
print(f"Cross validation mean accuracy: {mean_accuracy}")

Cross validation mean accuracy: 0.943028376030238


# Technical challenge : Imitation of a function using a Neural network


In this challenge you will have to create a Recurrent Neural Networks (RNN) that takes as an input a vector $X=[x_1,x_2,..,x_{M_X}]$ of size $(1,M_X)$ (where $M_X$ can be of any size between 1 and 20 and each component $x_i$ between -10 and 10) and gives as an output the y= $\sum_{i=1}^{M_X}|x_i| $. In other words, the network will have to emulate the l1 norm.

For instance, if I give the vector $X_{example1}=[ -4;-2;3]$ as an input to the network, it has to output $4+2+3 =9$. 

If I give the vector $X_{example2} = [ -4; -2 ; 3 ; 10 ; -5]$ as an input to the network, it has to output $4+2+3+10+5 =24$

For this challenge you can use all library and tools that you want (Keras, tensorflow, pytorch..). Please put your code in this jupyter notebook. I didn't provide a training/testing set because you will have to create it yourself !

Here is a good ressource if you don't know anything about RNN: 

- https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
- https://www.cpuheater.com/deep-learning/introduction-to-recurrent-neural-networks-in-pytorch/
- https://medium.com/dair-ai/building-rnns-is-fun-with-pytorch-and-google-colab-3903ea9a3a79

In [1]:
from dataset import VectDataset, PadCollate
from model import RNNRegressor
import torch.nn as nn
import torch.utils.data as data
from train import train

model = RNNRegressor(input_size=1, hidden_size=30, num_layers=3)
train_dataset = VectDataset(dataset_size=10000)
train_dataloader = data.DataLoader(train_dataset, batch_size=1000, collate_fn=PadCollate(dim=0))
train(model, dataloader=train_dataloader, epochs=600, criterion=nn.MSELoss())


CUDA
[Epoch 0, Batch 0/10]:  [Loss: 3912.04]


  epoch, ind_batch, len(dataloader), loss.data[0]


[Epoch 1, Batch 0/10]:  [Loss: 3909.72]
[Epoch 2, Batch 0/10]:  [Loss: 3907.26]
[Epoch 3, Batch 0/10]:  [Loss: 3904.62]
[Epoch 4, Batch 0/10]:  [Loss: 3901.79]
[Epoch 5, Batch 0/10]:  [Loss: 3898.40]
[Epoch 6, Batch 0/10]:  [Loss: 3893.88]
[Epoch 7, Batch 0/10]:  [Loss: 3887.57]
[Epoch 8, Batch 0/10]:  [Loss: 3879.18]
[Epoch 9, Batch 0/10]:  [Loss: 3867.71]
[Epoch 10, Batch 0/10]:  [Loss: 3851.27]
[Epoch 11, Batch 0/10]:  [Loss: 3825.51]
[Epoch 12, Batch 0/10]:  [Loss: 3783.84]
[Epoch 13, Batch 0/10]:  [Loss: 3713.51]
[Epoch 14, Batch 0/10]:  [Loss: 3589.43]
[Epoch 15, Batch 0/10]:  [Loss: 3361.85]
[Epoch 16, Batch 0/10]:  [Loss: 2935.22]
[Epoch 17, Batch 0/10]:  [Loss: 2144.58]
[Epoch 18, Batch 0/10]:  [Loss: 937.15]
[Epoch 19, Batch 0/10]:  [Loss: 628.21]
[Epoch 20, Batch 0/10]:  [Loss: 485.81]
[Epoch 21, Batch 0/10]:  [Loss: 479.23]
[Epoch 22, Batch 0/10]:  [Loss: 440.58]
[Epoch 23, Batch 0/10]:  [Loss: 428.33]
[Epoch 24, Batch 0/10]:  [Loss: 408.23]
[Epoch 25, Batch 0/10]:  [Loss: 

[Epoch 209, Batch 0/10]:  [Loss: 4.54]
[Epoch 210, Batch 0/10]:  [Loss: 4.50]
[Epoch 211, Batch 0/10]:  [Loss: 4.45]
[Epoch 212, Batch 0/10]:  [Loss: 4.41]
[Epoch 213, Batch 0/10]:  [Loss: 4.37]
[Epoch 214, Batch 0/10]:  [Loss: 4.33]
[Epoch 215, Batch 0/10]:  [Loss: 4.29]
[Epoch 216, Batch 0/10]:  [Loss: 4.25]
[Epoch 217, Batch 0/10]:  [Loss: 4.21]
[Epoch 218, Batch 0/10]:  [Loss: 4.17]
[Epoch 219, Batch 0/10]:  [Loss: 4.13]
[Epoch 220, Batch 0/10]:  [Loss: 4.09]
[Epoch 221, Batch 0/10]:  [Loss: 4.05]
[Epoch 222, Batch 0/10]:  [Loss: 4.02]
[Epoch 223, Batch 0/10]:  [Loss: 3.98]
[Epoch 224, Batch 0/10]:  [Loss: 3.94]
[Epoch 225, Batch 0/10]:  [Loss: 3.91]
[Epoch 226, Batch 0/10]:  [Loss: 3.87]
[Epoch 227, Batch 0/10]:  [Loss: 3.84]
[Epoch 228, Batch 0/10]:  [Loss: 3.80]
[Epoch 229, Batch 0/10]:  [Loss: 3.77]
[Epoch 230, Batch 0/10]:  [Loss: 3.73]
[Epoch 231, Batch 0/10]:  [Loss: 3.70]
[Epoch 232, Batch 0/10]:  [Loss: 3.67]
[Epoch 233, Batch 0/10]:  [Loss: 3.63]
[Epoch 234, Batch 0/10]: 

[Epoch 421, Batch 0/10]:  [Loss: 0.79]
[Epoch 422, Batch 0/10]:  [Loss: 0.78]
[Epoch 423, Batch 0/10]:  [Loss: 0.78]
[Epoch 424, Batch 0/10]:  [Loss: 0.77]
[Epoch 425, Batch 0/10]:  [Loss: 0.77]
[Epoch 426, Batch 0/10]:  [Loss: 0.76]
[Epoch 427, Batch 0/10]:  [Loss: 0.76]
[Epoch 428, Batch 0/10]:  [Loss: 0.75]
[Epoch 429, Batch 0/10]:  [Loss: 0.75]
[Epoch 430, Batch 0/10]:  [Loss: 0.74]
[Epoch 431, Batch 0/10]:  [Loss: 0.73]
[Epoch 432, Batch 0/10]:  [Loss: 0.73]
[Epoch 433, Batch 0/10]:  [Loss: 0.72]
[Epoch 434, Batch 0/10]:  [Loss: 0.72]
[Epoch 435, Batch 0/10]:  [Loss: 0.71]
[Epoch 436, Batch 0/10]:  [Loss: 0.71]
[Epoch 437, Batch 0/10]:  [Loss: 0.70]
[Epoch 438, Batch 0/10]:  [Loss: 0.70]
[Epoch 439, Batch 0/10]:  [Loss: 0.69]
[Epoch 440, Batch 0/10]:  [Loss: 0.69]
[Epoch 441, Batch 0/10]:  [Loss: 0.68]
[Epoch 442, Batch 0/10]:  [Loss: 0.68]
[Epoch 443, Batch 0/10]:  [Loss: 0.68]
[Epoch 444, Batch 0/10]:  [Loss: 0.67]
[Epoch 445, Batch 0/10]:  [Loss: 0.67]
[Epoch 446, Batch 0/10]: 

In [3]:
from test import test

test_dataset = VectDataset(dataset_size=2000)
test_dataloader = data.DataLoader(test_dataset, batch_size=500, collate_fn=PadCollate(dim=0))
test(model, dataloader=test_dataloader, criterion=nn.MSELoss())


Running on cuda
cuda
MSELoss test loss = 0.217


6
