# ------------------> Online News Popularity <-----------------

<img src='http://globalorphanprevention.org/uploads/3/4/3/1/34317250/1434670_orig.jpg' width='1200px'/>

## Contributions 
- **[Abhisar](https://www.kaggle.com/abhisarnarkhede)** : Data collection and cleaning
- **[Deepak](https://www.kaggle.com/deepakshende) and [Amir](https://www.kaggle.com/aahaan007)** : Data exploration, analysis, reporting and dashboarding
- **[Vishnu](https://www.kaggle.com/psvishnu)**: Data modelling and deployment
- **[Sneharth](https://www.kaggle.com/sneharth03) and [Sharmistha](https://www.kaggle.com/sharmistha96)** : Assisting all the above steps however, most importantly will define business requirement, data validation part.

## Business understanding
* This dataset summarizes a heterogeneous set of features about articles published by Mashable in a period of two years. The goal is to predict the number of shares in social networks, i.e. how popular any given article is. The dataset is publicly available at University of California Irvine Machine Learning Repository.

## Attribute Description :

#### Input variables : 
* url: URL of the article (non-predictive)
* timedelta: Days between the article publication and the dataset acquisition (non-predictive)
* n_tokens_title: Number of words in the title
* n_tokens_content: Number of words in the content
* n_unique_tokens: Rate of unique words in the content
* n_non_stop_words: Rate of non-stop words in the content
* n_non_stop_unique_tokens: Rate of unique non-stop words in the content
* num_hrefs: Number of links
* num_self_hrefs: Number of links to other articles published by Mashable
* num_imgs: Number of images
* num_videos: Number of videos
* average_token_length: Average length of the words in the content
* num_keywords: Number of keywords in the metadata
* data_channel_is_lifestyle: Is data channel 'Lifestyle'?
* data_channel_is_entertainment: Is data channel 'Entertainment'?
* data_channel_is_bus: Is data channel 'Business'?
* data_channel_is_socmed: Is data channel 'Social Media'?
* data_channel_is_tech: Is data channel 'Tech'?
* data_channel_is_world: Is data channel 'World'?
* kw_min_min: Worst keyword (min. shares)
* kw_max_min: Worst keyword (max. shares)
* kw_avg_min: Worst keyword (avg. shares)
* kw_min_max: Best keyword (min. shares)
* kw_max_max: Best keyword (max. shares)
* kw_avg_max: Best keyword (avg. shares)
* kw_min_avg: Avg. keyword (min. shares)
* kw_max_avg: Avg. keyword (max. shares)
* kw_avg_avg: Avg. keyword (avg. shares)
* self_reference_min_shares: Min. shares of referenced articles in Mashable
* self_reference_max_shares: Max. shares of referenced articles in Mashable
* self_reference_avg_sharess: Avg. shares of referenced articles in Mashable
* weekday_is_monday: Was the article published on a Monday?
* weekday_is_tuesday: Was the article published on a Tuesday?
* weekday_is_wednesday: Was the article published on a Wednesday?
* weekday_is_thursday: Was the article published on a Thursday?
* weekday_is_friday: Was the article published on a Friday?
* weekday_is_saturday: Was the article published on a Saturday?
* weekday_is_sunday: Was the article published on a Sunday?
* is_weekend: Was the article published on the weekend?
* LDA_00: Closeness to LDA topic 0
* LDA_01: Closeness to LDA topic 1
* LDA_02: Closeness to LDA topic 2
* LDA_03: Closeness to LDA topic 3
* LDA_04: Closeness to LDA topic 4
* global_subjectivity: Text subjectivity
* global_sentiment_polarity: Text sentiment polarity
* global_rate_positive_words: Rate of positive words in the content
* global_rate_negative_words: Rate of negative words in the content
* rate_positive_words: Rate of positive words among non-neutral tokens
* rate_negative_words: Rate of negative words among non-neutral tokens
* avg_positive_polarity: Avg. polarity of positive words
* min_positive_polarity: Min. polarity of positive words
* max_positive_polarity: Max. polarity of positive words
* avg_negative_polarity: Avg. polarity of negative words
* min_negative_polarity: Min. polarity of negative words
* max_negative_polarity: Max. polarity of negative words
* title_subjectivity: Title subjectivity
* title_sentiment_polarity: Title polarity
* abs_title_subjectivity: Absolute subjectivity level
* abs_title_sentiment_polarity: Absolute polarity level
* shares: Number of shares (target)

## Dependencies

In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler
import os # accessing directory structure
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('ggplot')

## Loading and Describing data

In [None]:
nRowsRead = None
df_newspopularity = pd.read_csv('/kaggle/input/OnlineNewsPopularity.csv', delimiter=',', nrows = nRowsRead)
df_newspopularity.dataframeName = 'OnlineNewsPopularity.csv'
nRow, nCol = df_newspopularity.shape
print(f'There are {nRow} rows and {nCol} columns')

### Let's take a quick look at what the data looks like (first 250 rows) :

In [None]:
df_newspopularity.head(250)

### Describing each column :

In [None]:
df_newspopularity.describe(include='all')

### Distribution graphs (histogram/bar graph) of each column :

In [None]:
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()

In [None]:
plotPerColumnDistribution(df_newspopularity, 61, 5)

### Correlation matrix :

In [None]:
df_newspopularity.corr()

In [None]:
# Distribution graphs (histogram/bar graph) of column data
def plotPerColumnDistribution(df, nGraphShown, nGraphPerRow):
    nunique = df.nunique()
    df = df[[col for col in df if nunique[col] > 1 and nunique[col] < 50]] # For displaying purposes, pick columns that have between 1 and 50 unique values
    nRow, nCol = df.shape
    columnNames = list(df)
    nGraphRow = (nCol + nGraphPerRow - 1) / nGraphPerRow
    plt.figure(num = None, figsize = (6 * nGraphPerRow, 8 * nGraphRow), dpi = 80, facecolor = 'w', edgecolor = 'k')
    for i in range(min(nCol, nGraphShown)):
        plt.subplot(nGraphRow, nGraphPerRow, i + 1)
        columnDf = df.iloc[:, i]
        if (not np.issubdtype(type(columnDf.iloc[0]), np.number)):
            valueCounts = columnDf.value_counts()
            valueCounts.plot.bar()
        else:
            columnDf.hist()
        plt.ylabel('counts')
        plt.xticks(rotation = 90)
        plt.title(f'{columnNames[i]} (column {i})')
    plt.tight_layout(pad = 1.0, w_pad = 1.0, h_pad = 1.0)
    plt.show()
# Correlation matrix
def plotCorrelationMatrix(df, graphWidth):
    filename = df.dataframeName
    df = df.dropna('columns') # drop columns with NaN
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    if df.shape[1] < 2:
        print(f'No correlation plots shown: The number of non-NaN or constant columns ({df.shape[1]}) is less than 2')
        return
    corr = df.corr()
    plt.figure(num=None, figsize=(graphWidth, graphWidth), dpi=80, facecolor='w', edgecolor='k')
    corrMat = plt.matshow(corr, fignum = 1)
    plt.xticks(range(len(corr.columns)), corr.columns, rotation=90)
    plt.yticks(range(len(corr.columns)), corr.columns)
    plt.gca().xaxis.tick_bottom()
    plt.colorbar(corrMat)
    plt.title(f'Correlation Matrix for {filename}', fontsize=15)
    plt.show()

In [None]:
plotCorrelationMatrix(df_newspopularity, 10)

### Scatter and Density plots:

In [None]:
# Scatter and density plots
def plotScatterMatrix(df, plotSize, textSize):
    df = df.select_dtypes(include =[np.number]) # keep only numerical columns
    # Remove rows and columns that would lead to df being singular
    df = df.dropna('columns')
    df = df[[col for col in df if df[col].nunique() > 1]] # keep columns where there are more than 1 unique values
    columnNames = list(df)
    if len(columnNames) > 10: # reduce the number of columns for matrix inversion of kernel density plots
        columnNames = columnNames[:10]
    df = df[columnNames]
    ax = pd.plotting.scatter_matrix(df, alpha=0.75, figsize=[plotSize, plotSize], diagonal='kde')
    corrs = df.corr().values
    for i, j in zip(*plt.np.triu_indices_from(ax, k = 1)):
        ax[i, j].annotate('Corr. coef = %.3f' % corrs[i, j], (0.8, 0.2), xycoords='axes fraction', ha='center', va='center', size=textSize)
    plt.suptitle('Scatter and Density Plot')
    plt.show()

In [None]:
plotScatterMatrix(df_newspopularity, 20, 10)