# Internet News Prediction & EDA
Welcome to today's notebook, where we will be visualising and predicting dataset which includes different newspaper articles and their details.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import seaborn as sns
import matplotlib.pyplot as plt
from collections import Counter
from wordcloud import WordCloud
from xgboost import XGBClassifier
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Preparing the data
The first step will be to fill the null values of the data and drop the 'Unnamed: 0' feature which is useless.

In [None]:
df = pd.read_csv('../input/internet-articles-data-with-users-engagement/articles_data.csv')
df = df.drop('Unnamed: 0', axis=1)

df['title'] = df['title'].fillna('NaN')
df['description'] = df['description'].fillna('NaN')
df['content'] = df['content'].fillna('NaN')
df['published_at'] = df['published_at'].fillna('NaN')

df['engagement_reaction_count'] = df['engagement_reaction_count'].fillna(0)
df['engagement_comment_count'] = df['engagement_comment_count'].fillna(0)
df['engagement_share_count'] = df['engagement_share_count'].fillna(0)
df['engagement_comment_plugin_count'] = df['engagement_comment_plugin_count'].fillna(0)

In [None]:
df.head()

# Visualising the data
Next, we will perform EDA on our features.

The following cell is a procedure which plots out a bar chart that can tell us the distribution of the different variables.

In [None]:
def bar_charts(title, x, y, colour, values, keys, figsize=(10, 5), fontsize=12):
    fig, ax = plt.subplots(1, 1, figsize=figsize)
    bars = plt.bar(keys, values, color=colour)

    for bar in bars:
        label = list(count)[list(bars).index(bar)]
        height = bar.get_height()
        plt.text(bar.get_x() + bar.get_width()/2, height, label, ha='center', va='bottom', 
                 fontsize=fontsize)

    plt.title(title, fontsize=fontsize)
    plt.xlabel(x, fontsize=fontsize)
    plt.ylabel(y, fontsize=fontsize)
    plt.show()

Afterwards, another procedure is defined, though this time in the form of a line graph.

In [None]:
def plots(title, x, y, values, keys, figsize=(10, 5), fontsize=12):
    fig, ax = plt.subplots(1, 1, figsize=figsize)
    plt.plot(list(keys), list(values))
    plt.title(title)
    plt.xlabel(x)
    plt.ylabel(y)
    plt.show()

A very useful technique for visualisation is the WordCloud, which shows what words are the most frequently occurring.

In [None]:
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 20))

for ax in [[ax1, 'title'], [ax2, 'description'], [ax3, 'content']]:
    wordcloud = WordCloud(background_color='white').generate(' '.join(df[ax[1]]))
    ax[0].set_title(ax[1], fontsize=20)
    ax[0].imshow(wordcloud)
    ax[0].axis('off')
    
plt.show()

## Bar charts

The first variable that we will display is the 'source_name' column which describes the originators of the article. As seen below, the 'Reuters', 'BBC News' and 'Irish Times' have written the most articles.

In [None]:
count = Counter(df['source_name'])
count = pd.Series(count).sort_values(ascending=False)

keys = list(count.keys())
keys[keys.index('The New York Times')] =  'The NY Times'
keys[keys.index('Al Jazeera English')] = 'Al Jazeera'
keys[keys.index('The Wall Street Journal')] =  'Wall Street Journal'

bar_charts('Articles per source', 'Source name', 'Number of articles', 'blue', count, keys,
          (20, 13), 18)

Subsequently, we now show the distribution of how many articles were written in September and October. The amount written in October was less than a third of that written in September.

In [None]:
month = [i[5:7] for i in df['published_at']]
count = Counter(month)
count = pd.Series(count).sort_values(ascending=False)[:2]

bar_charts('Distribution of articles released per month', 'Month number', 'Number of articles',
          'orange', count, count.keys(), figsize=(15, 10))

The following EDA is how many pieces were written per day, only looking at those in September. The most that was written on one day was on the third of September.

In [None]:
day = [i[8:10] for i in df['published_at']]
count = Counter(day)
count = pd.Series(count).sort_values(ascending=False)[:13]

bar_charts('Day that articles were released', 'Day released', 'Number of articles', 'purple', 
           count, count.keys(), figsize=(15, 10))

Now we will use a bar chart to take a look into which hours are the most popular for releasing a news piece. It seems that around 2-4 pm is the most regular time that people publish their content.

In [None]:
hour = [i[11:13] for i in df['published_at']]
count = Counter(hour)
count = pd.Series(count).sort_values(ascending=False)[:20]

bar_charts('Hours that articles were released', 'Hour released', 'Number of articles',
          'green', count, count.keys(), figsize=(13, 10))

## Line graphs

Furthermore, we switch our attention to line graphs, where we will look at a how many articles were released.

In [None]:
day_and_month = pd.DataFrame([])
day_and_month['day'] = day
day_and_month['month'] = month

count1 = Counter(day_and_month[day_and_month['month']=='09']['day'])
count2 = Counter(day_and_month[day_and_month['month']=='10']['day'])

keys = pd.concat([pd.Series(count1.keys()), pd.Series(count2.keys())[:2]])
values = pd.concat([pd.Series(count1.values()), pd.Series(count2.values())[:2]])
count = dict(zip(keys, values))

plots('Articles released over the days', 'Days', 'Number of articles', count.values(),
      count.keys(), (13, 8))

The next four line graphs are about the engagement of the reaction, comment, share and comment-plugin.

There are many spikes in these visualisations, which show that there is no steady order of the variables and that these increases in the data are seemingly unexpected.

In [None]:
plots('Engagement reaction count over time', 'Time', 'Engagement reaction count', 
      df['engagement_reaction_count'], df['engagement_reaction_count'].keys(), (13, 8))

In [None]:
plots('Engagement comment count over time', 'Time', 'Engagement comment count', 
      df['engagement_comment_count'], df['engagement_comment_count'].keys(), (13, 8))

In [None]:
plots('Engagement share count over time', 'Time', 'Engagement share count', 
      df['engagement_share_count'], df['engagement_share_count'].keys(), (13, 8))

In [None]:
plots('Engagement comment plugin count over time', 'Time', 'Engagement comment plugin count', 
      df['engagement_comment_plugin_count'], df['engagement_comment_plugin_count'].keys(), 
      (13, 8))

## Correlation

Next, we use a heatmap to check out whether there are correlations between any of the variables. We see that there are three sets of features that do have connections, which means that they have a dependency on each other.

In [None]:
sns.heatmap(df.corr(), annot=True)
plt.title('Correlation of variables')
plt.show()

Lastly, we will scatter the datapoints and create a line of best fit to have a closer look at how the different variables correlate to each other.

We remove some outliers in the columns because they could skew our results.

In [None]:
df['engagement_reaction_count'] = df['engagement_reaction_count'][df['engagement_reaction_count']<100000]
df['engagement_share_count'] = df['engagement_share_count'][df['engagement_share_count']<20000]

In [None]:
fontsize=15
fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(20, 7))

for ax in [[ax1, ['engagement_reaction_count', 'engagement_comment_count']],
           [ax2, ['engagement_reaction_count', 'engagement_share_count']],
           [ax3, ['engagement_share_count', 'engagement_comment_count']]]:
    sns.regplot(data=df, x=ax[1][0], y=ax[1][1], ax=ax[0])
    ax[0].set_xlabel(ax[1][0], fontsize=fontsize)
    ax[0].set_ylabel(ax[1][1], fontsize=fontsize)

ax2.set_title('Correlation of variables', fontsize=30, pad=30)
plt.show()

# Predicting the data

## Splitting our dataset

We assign an 'X' variable to the 'content' feature and 'y' to our 'source_name' feature. They then go on to be further split into train and test sets.

In [None]:
X = df['content']
y = df['source_name']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

## Using NLP

We must first convert our data from textual to numerical format in order for us to input it into a predictor. The way this is done is through a 'CountVectorizer' and then a 'TFIDF' model.

In [None]:
cv = CountVectorizer()
tfidf = TfidfTransformer()

X_train = cv.fit_transform(X_train)
X_test = cv.transform(X_test)

X_train = tfidf.fit_transform(X_train)
X_test = tfidf.transform(X_test)

## Using TruncatedSVD

Now we will reduce the unwanted parts of our data using a TruncatedSVD model, which is basically a PCA for text.

In [None]:
svd = TruncatedSVD(n_components=2000)
X_train = svd.fit_transform(X_train)
X_test = svd.transform(X_test)

## Creating and evaluating classifiers

Next, will train three different classifiers: 'SGD', 'Random Forest' and 'Linear SVC' and then evaluate their performance.

In [None]:
classifiers = [['SGD', SGDClassifier()], ['Random Forest', RandomForestClassifier()],
              ['Linear SVC', LinearSVC()]]
scores = []
cross_vals = []

for classifier in classifiers:
    model = classifier[1]
    model.fit(X_train, y_train)

    score = model.score(X_test, y_test)
    cross_val = cross_val_score(model, X_test, y_test).mean()
    scores.append(score)
    cross_vals.append(cross_val)
    
    print(classifier[0])
    print(score)
    print(cross_val)
    if model != classifiers[-1][1]:
        print('')

Finally, we use bar charts to visualise how well each classifier has performed in relation to model score and cross val score. We can see that the best predictor for this data is the Linear SVC, followed by SGD Classifier and then the Random Forest.

In [None]:
names = ['SGD', 'Random Forest', 'Linear SVC']
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(13, 9))

for ax in [[ax1, scores, 'model score'], [ax2, cross_vals, 'cross validation score']]:
    metric = ax[1]
    bars = ax[0].bar(names, metric, color='blue')
    for bar in bars:
        label = str(metric[list(bars).index(bar)])[:4]
        height = bar.get_height()
        ax[0].text(bar.get_x() + bar.get_width()/2, height, label, ha='center', va='bottom')
    ax[0].set_title(ax[2])
    ax[0].set_xlabel('model')
    ax[0].set_ylabel('accuracy')

plt.show()

### Thank you for reading my notebook.
### If you enjoyed this notebook and found it helpful, please upvote it and give feedback as it will help me make more of these.