# This notebook is derived from the notebook below by Priyanka Sachdeva. Many thanks.
https://www.kaggle.com/priyankasachdeva20/autoviz-fake-news-classifier-6-ml-models

# We are going to use Auto_NLP to see if we can get a better score using automated ML techniques vs. manual tuning as we saw in the above notebook.
Auto_NLP can be found here: https://github.com/AutoViML/Auto_ViML

In [None]:
import pandas as pd
import numpy as np
import nltk 
import re
from pandas_profiling import ProfileReport
import matplotlib.pyplot as plt
import seaborn as sns 
import sklearn.metrics
import sklearn
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [None]:
data=pd.read_csv("../input/source-based-news-classification/news_articles.csv")
print(data.shape)
data.head()

# Since the data set above has both Text and Tabular data, we are going to use Auto_ViML which has Auto_NLP built-in so that it can find the best model and the best NLP technique for this complex data set.

In [None]:
!pip install autoviml

In [None]:
from autoviml.Auto_ViML import Auto_ViML

# Since there are duplicate columns such as title and title_without_stopwords, etc.,we will remove duplicates and combine all text columns into one column called "NLP_column" to make Auto_ViML processing easier.

In [None]:
print(data.shape)
data.drop([ 'title_without_stopwords','text_without_stopwords'], axis=1,inplace=True)
data['NLP_column'] = data['title']+'  '+ (data['text']) + "  "+ data['main_img_url']
data.drop([ 'title','text', 'main_img_url'], axis=1,inplace=True)
print(data.shape)

In [None]:
data.head(1)

# We'll split the dataset into two the same way as above notebook so we can see whether the AutoML library can generalize well on unseen data (test) just as manual training did.

In [None]:
### There is one NaN value in label column => so let's drop that single row!!
### Holy cow! there was ~50 rows with NaN = though AutoViML can handle NaN rows, 
###  we will delete them here to make the comparison apples to apples with previous noteboonk
print(data.shape)
data=data.dropna()
print(data.shape)

In [None]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.2, random_state = 42)

print(len(train), len(test))

# Now let us run Auto_ViML (including Auto_NLP on train and test). This may take about 10 mins to run

In [None]:
target = 'label'

In [None]:
m, feats, trainm, testm = Auto_ViML(train, target, test,
                            sample_submission='',
                            scoring_parameter='', KMeans_Featurizer=False,
                            hyper_param='RS',feature_reduction=True,
                             Boosting_Flag='CatBoost', Binning_Flag=False,
                            Add_Poly=0, Stacking_Flag=False,Imbalanced_Flag=False,
                            verbose=1)

Wow that was a 100% accuracy on validation data! Now let's see how well Auto_ViML performs on unseen test data

# Evaluating Auto_ViML on unseen test data

# Auto_ViML scores a 100 on Unseen Test data!

In [None]:
from sklearn.metrics import classification_report, confusion_matrix

In [None]:
y_true = test[target].values
y_pred = testm[target+'_predictions'].values
print(y_true.shape, y_pred.shape)

In [None]:
print(classification_report(y_true, y_pred))

In [None]:
confusion_matrix(y_true, y_pred)

So Auto_VIML has generalized well and performed well on Unseen (test) data as well