# Exploratory Analysis

We use Jupyer Notebook for this analysis since it is easier and faster than standard python scripts, in particular when we need to draw plots.

Other than plots and insights about some distribution, a small fraction of feature extraction is performed at the end.
- the number of tokens in text fields 
- info from urls

This task is done also via python scripts, much better way to keep track of the experiments and to speed up the eventual productionalization.

## Load Data

In [None]:
# Importing required libraries
import os
import pandas as pd
import numpy as np
import seaborn as sns #visualisation
import matplotlib.pyplot as plt #visualisation
%matplotlib inline 
sns.set(color_codes=True)

In [None]:
# find path input data file (check readme.txt in folder "/data" for more info about how to get this file)
notebooks_folder_path = os.path.abspath('') # /notebooks
data_folder_path = os.path.join(notebooks_folder_path, '..', 'data') # /notebooks

path_data = os.path.join(data_folder_path, 'data_redacted.tsv')

In [None]:
# load it into df
df = pd.read_csv(path_data, sep='\t')

In [None]:
df.shape

In [None]:
df.head()

In [None]:
duplicate_rows_df = df[df.duplicated()]
duplicate_rows_df.shape

# there are no duplicates

In [None]:
# Finding the null values
print(df.isnull().sum())

## Explore columns

In [None]:
# Task is to classify the category "starting" from the columns "title", "text" and "url"

### Category

In [None]:
# lets start to see the "category" distribution
df['category'].value_counts().plot(kind='bar', figsize=(10,5))
plt.title("Category Distribution")

# The distribution is not uniform: it is quite unbalanced even if not too drammatically. 
# Anyway I need to take care about this in the metrics...

### Title and Text

In [None]:
# "title" and "text" are quite straigthforward fields for news classification

# However, the len of the text can be pretty informative from my experience
# lets see the distribution of the number of words in those fields

df['title_len'] = df['title'].str.len()
df['text_len'] = df['text'].str.len()

In [None]:
# some basic statistics about title_len
df['title_len'].describe()

In [None]:
# some basic statistics about title_len
df['text_len'].describe()

In [None]:
sns.boxplot(x=df['title_len'])

# title is usually a short string
# max number of tokens < 512, so I dont see any limitations for the kind of model I want use use next for classification

In [None]:
sns.boxplot(x=df['text_len'])

# text is much longer than the title...
# Considering all the text can be a problem since a lot of computing power is needed resulting in
#       higher costs
#       slower inferences
#       and maybe a not so significant improvemnt of the accuracy

# From my experience, it is better to truncate the text (i take the first N tokens), 
# but i can consider to keep the last N or whatever...

# here there is a thread about text classification for long texts using transformer
# https://stackoverflow.com/questions/58636587/how-to-use-bert-for-long-text-classification?rq=1

# Along the use of Reformer or Longformer, it is recommended a truncation anyway

In [None]:
# does categories have different len words distributions? 

In [None]:
ax = sns.boxplot(x="category", y="title_len", data=df)
plt.xticks(rotation=90)

In [None]:
ax = sns.boxplot(x="category", y="text_len", data=df)
plt.xticks(rotation=90)

In [None]:
sns.jointplot(
    data=df,
    x="text_len", y="title_len", hue="category",
    kind="kde"
)

In [None]:
# len words distributions can be very different across the category

In [None]:
sns.jointplot(
    data=df[df['category'].isin(['sports', 'fashion_beauty_lifestyle', 'technology_science', 'cars_motors'])],
    x="text_len", y="title_len", hue="category",
    kind="kde"
)

In [None]:
df.sample(50)['text']

# english seems to be the only language

### Url

In [None]:
# "url" is less readable and it can hide precious information... 
# I want to see if I can get some useful insigths I can exploit in the ML model 

In [None]:
# print some urls
df['url'].tolist()[:10]

In [None]:
# url can contain very useful information!
#     Some domains are repeated and frequent
#     From some sources, I expect a certain bias 
#           example from www.sciencedaily.com i can expect many news about "technology_science"
#     In the path of the url there are some keywords I can use
#           take a look at the example below where i see "beauty" in the url...

example_url = df[['url', 'category']].iloc[4]['url']
example_category = df[['url', 'category']].iloc[4]['category']

print(example_url)
print(example_category)

In [None]:
# i will try to parse those urls
# https://docs.python.org/3/library/urllib.parse.html
from urllib.parse import urlparse

In [None]:
# some examples taken randomly

In [None]:
example_url = df['url'].iloc[100]
urlparse(example_url)

In [None]:
example_url = df['url'].iloc[105]
urlparse(example_url)

In [None]:
example_url = df['url'].iloc[21]
urlparse(example_url)

In [None]:
# "netloc" and "path" are probably the most useful info I could need

In [None]:
df['urlparse'] = df['url'].apply(urlparse)
df['netloc'] = df['urlparse'].apply(lambda x: x.netloc)
df['netloc']

In [None]:
df['netloc'].value_counts()

# there are many repetions for some netlocs

In [None]:
netloc_count = df['netloc'].value_counts()
different_netlocs = len(netloc_count)
different_netlocs

In [None]:
# count distribution
netloc_count.plot()

In [None]:
# there are a few domains very frequent, while many of them appears just 1,2 or 3 times

In [None]:
# focus
netloc_count[:40].plot()
plt.xticks(rotation=90)

In [None]:
netloc_count[:20]

In [None]:
# I will keep the "netloc" for the most frequent labels
# I'll map all the rest in a fictious label called "Other"

In [None]:
df['netloc_mod'] = df['netloc'].mask(df['netloc'].map(df['netloc'].value_counts()) < 50, 'Other')
df['netloc_mod']

In [None]:
# this is the new distribution of the netloc
new_count_distr = df['netloc_mod'].value_counts()
new_count_distr

In [None]:
sns.set(rc={'figure.figsize':(20.7,8.27)})
ax = sns.countplot(x="netloc_mod", hue="category", data=df)
plt.xticks(rotation=90)

In [None]:
# from this plot i can see that there are dependencies between the category and the netloc as the intuition suggests
#      from phys.org i see only science related news (as well as for wired.co.uk)

In [None]:
# after the netloc, i can extract the path 
df['path'] = df['urlparse'].apply(lambda x: x.path)

In [None]:
df['path'][:10]

In [None]:
# from urls i can also extract query and params, but they dont seem useful as you can see below...

In [None]:
df['query'] = df['urlparse'].apply(lambda x: x.query)
df['query'].value_counts()[:10]

In [None]:
df['params'] = df['urlparse'].apply(lambda x: x.params)
df['params'].value_counts()[:10]