# Exploration
This notebook will explore our dataset to inform our cleaning activities.

## Dictionary

- ID : the numeric ID of the article
- TITLE : the headline of the article
- URL : the URL of the article
- PUBLISHER : the publisher of the article
- CATEGORY : the category of the news item; one of:
  - e : entertainment
  - b : business
  - t : science and technology
  - m : health
- STORY : alphanumeric ID of the news story that the article discusses
- HOSTNAME : hostname where the article was posted
- TIMESTAMP : approximate timestamp of the article's publication, given in Unix time (seconds since midnight on Jan 1, 1970)

In [None]:
# Import packages
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import re

from sklearn.feature_extraction.text import CountVectorizer

from src.model_operations import count_tokens

ModuleNotFoundError: No module named 'matplotlib'

In [None]:
# Import data
data = pd.read_csv("data/downsampled_dataset.csv")

# Global Dataset Statistics

In [None]:
# What does the data look like?
data.head()

In [None]:
# How many observations?
len(data)

In [None]:
# What is a typical article title?
data['TITLE'][100]

# Target Column Exploration

In [None]:
# Any missing values?
data['CATEGORY'].isnull().sum()

In [None]:
# Distribution of values
data['CATEGORY'].value_counts()

# Text Column Exploration

In [None]:
# Any missing values?
data['TITLE'].isnull().sum()

## Length of TITLE (number of words)

This has a large impact on our choice of Large Language Model (LLM). 

LLMs operate on a fixed sequence length.  

This means any string shorter than N number of words will be padded to reach a max length.  

This max length is set when the LLM is trained.  

Further, this is one of the factors that makes different LLMs better at a task than others. 

In [None]:
# Check average length of TITLE values
data['title length'] = data['TITLE'].apply(count_tokens)


# Print out metrics
print(f"Minimum number of words in TITLE: {np.round(data['title length'].min())}")
print(f"Average number of words in TITLE: {np.round(data['title length'].mean())}")
print(f"Maximum number of words in TITLE: {np.round(data['title length'].max())}")

In [None]:
# Show distribution of title length
sns.distplot(data['title length'])

In [None]:
# Investigate Max values in TITLE
high_values = data[data['title length']>10]

print(f"Number of TITLE obs with more than n words: {len(high_values)}")
high_values['TITLE'].iloc[0]

In [None]:
# Investigate Min and Max values in TITLE
low_values = data[data['title length']<= 3]

# Number of short titles
print(f"Number of TITLE obs with less than n words: {len(low_values)}")
# Check a single value 
low_values['TITLE'].iloc[0]

## Check Special Characters

In [None]:
# How many observations have special characters?

def check_for_special_chars(data):
    """ Checks if a string is comprised entirely of alphanumeric characters. """
    return data.isalnum()

# Check to see if special characters are present
data['special chars present'] = data['TITLE'].apply(check_for_special_chars)
# Check number of observations with special characters
data['special chars present'].sum()

In [None]:
# Check how prevalent special chars are in the text
def remove_special_chars(data): 
    """ Removes special chars from a string. """
    return re.sub("[$&+;=@#|<>^*%-]","",data)

# Remove special characters
data['special chars removed'] = data['TITLE'].apply(check_for_special_chars)

## Check Word Frequency

In [None]:
# Most frequent 50 words

#Create DTM
cv = CountVectorizer(ngram_range = (1,1))
dtm = cv.fit_transform(data['TITLE'])
words = np.array(cv.get_feature_names_out())


#Look at top 50 most frequent words
freqs=dtm.sum(axis=0).A.flatten() 
index=np.argsort(freqs)[-20:] 
print(list(zip(words[index], freqs[index])))

WordFreq = pd.DataFrame.from_records(list(zip(words[index], freqs[index]))) 
WordFreq.columns = ['Word', 'Freq']

data = dict(zip(WordFreq['Word'].tolist(), WordFreq['Freq'].tolist()))

In [None]:
# Plot horizontal bar graph
fig, ax = plt.subplots(figsize=(8, 8))
WordFreq.sort_values(by='Freq').plot.barh(
                      x='Word',
                      y='Freq',
                      ax=ax,
                      color="deepskyblue")

plt.title("Count of Most Common Words")