# Final Report: BBC Document Analysis

By: Maclean Sherren

<br>


## Introduction

In my research into language models and the various pros and cons to each, I discovered a dataset known as the Google News dataset. Along with this dataset comes prre-trained vectors that contain 300-dimensional vectors for 3 million words and was trained on 100 billion words all related to large media news.

This report aims to explore the effectiveness of Google's pretrained language vectorization model when compared to regular data vectorization. First, I will explore the data, highlighting any anomalies or noteworthy features. Following, the data will be ran through a regular word vectorization model called Word2Vec, a similar model to TP-IDF. The transformed data will then be split and trained on two machine learning models - a linear model and a random forest. This will be repeated but with the Google News vectorization model instead of the Word2Vec model. Finally, the results of each vectorization model will be compared to help better understand the strengths and weaknesses of each model.

<br>


## Exploratory Data Analysis and Data Engineering

In [1]:
# Libraries
# !! Packages like gensim and sklearn may need to be installed using conda or pip to successfully run this cell

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
import gensim
from gensim import corpora, models, similarities, matutils
from gensim import utils
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
import gensim.downloader as api
from gensim.models import KeyedVectors

In [None]:
# Data
bbc_data = pd.read_csv('bbc_data.csv')

bbc_data.head()

# Number of documents in each label
print(bbc_data['labels'].value_counts())

# The number of documents in each label plotted

bbc_data_copy = bbc_data.copy()
bbc_data_copy['doc_length'] = bbc_data_copy['data'].apply(lambda x: len(x.split()))

bbc_data_copy.groupby('labels')['doc_length'].mean().plot(kind='bar')
plt.ylabel('Average number of words')
plt.title('Average number of words in each document per label')
plt.show()

In [None]:
# Boxplot of the number of words in each document per label

sns.boxplot(x='doc_length', y='labels', data=bbc_data_copy)
plt.ylabel('Number of words')
plt.title('Number of words in each document per label')
plt.show()

<br>
Interestingly, the two shortest labels by document word count are sports and entertainment, with business close beind and the topics of politics and tech taking much longer. Each topic has outliers in both length and words used.

<br>

In [None]:
# Train test split and preprocess data

X = bbc_data['data'].apply(gensim.utils.simple_preprocess)
y = bbc_data['labels']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)