# Test-Train Split

In the following code we take the preprocessed articles in cleaned_strings.csv created from 02-PreProcessing.ipynb and separate them into a training and test set. The training set will be composed of 90% of the articles from each topic (meaning the test set will be the remaining 10% of each topic), which we found preferable to simply taking 90% of all the articles, as that would open up the possibility of some topics being overrepresented or underrepresented in the training and test sets.

In [14]:
# Importing the necessary libraries
import os
import pandas as pd
import random
from numpy.random import RandomState
rng = RandomState()
random.seed(1)

# Read the preprocessed articles in and create 5 separate dataframes for each topic
df = pd.read_csv('cleaned_strings.csv')
print(df.shape)
b = df[df['category'] == 'business']
e = df[df['category'] == 'entertainment']
p = df[df['category'] == 'politics']
s = df[df['category'] == 'sport']
t = df[df['category'] == 'tech']

# Determining the number of articles from each topic to be included in the test set
num = []
num.append(round(len(b) * 0.1))
num.append(round(len(e) * 0.1))
num.append(round(len(p) * 0.1))
num.append(round(len(s) * 0.1))
num.append(round(len(t) * 0.1))


# Randomly selecting the desired number of articles from each topic to include in the test set
test = [b.sample(n=num[0], random_state=rng),
        e.sample(n=num[1], random_state=rng),
        p.sample(n=num[2], random_state=rng),
        s.sample(n=num[3], random_state=rng),
        t.sample(n=num[4], random_state=rng)]

# Putting the remainder of the articles in the training set
train = [b.loc[~b.index.isin(test[0].index)],
        e.loc[~e.index.isin(test[1].index)],
        p.loc[~p.index.isin(test[2].index)],
        s.loc[~s.index.isin(test[3].index)],
        t.loc[~t.index.isin(test[4].index)]]

# Turning the lists into dataframes
testframe = pd.concat(test)
trainframe = pd.concat(train)

print(testframe.shape)
print(trainframe.shape)

(2225, 3)
(223, 3)
(2002, 3)


In [15]:
# Creating .csv files from the dataframes

#************************************************************************************************
testframe.to_csv('articles_test_90_10.csv', sep=",", index=False)
trainframe.to_csv('articles_train_90_10.csv', sep=",", index=False)
#************************************************************************************************