<span style="font-family:Trebuchet MS; font-size:2em;">Project 3 | NB2: Cleaning and Preprocessing</span>

Riley Robertson | Reddit Classification Project | Market Research: Sports Fans in the U.S. and England

# **Imports and setup**

## Module Imports

I began my process by importing basic libraries and as I cleaned, I returned to add modules as necessary. I also set preferences, assigned variables, imported my data, and set up my main dataframe so I could begin cleaning.

In [1]:
# basics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

# custom
import utilities.densmore as dns

# date and time 
import datetime as dt
import time

# for CVEC test
from sklearn.feature_extraction.text import CountVectorizer

## Module Preferences

In [2]:
pd.set_option('display.max_colwidth', None)

---

## Data Import and Setup

In [3]:
df_nfl = pd.read_csv('../data/1_raw/raw_nfl_v4.csv', low_memory=False)
df_epl = pd.read_csv('../data/1_raw/raw_epl_v4.csv', low_memory=False)

In [4]:
df_nfl.shape, df_epl.shape

((99661, 13), (99589, 13))

In [5]:
# df_nfl_full = pd.read_csv('../git_ignore/output/raw_nfl_v4_full.csv', low_memory=False)
# df_epl_full = pd.read_csv('../git_ignore/output/raw_epl_v4_full.csv', low_memory=False)

In [6]:
# df_nfl_full.shape, df_epl_full.shape

### Merging the DataFrames

In [7]:
df = pd.concat([df_epl, df_nfl], ignore_index=True)

In [8]:
# df_full = pd.concat([df_epl_full, df_nfl_full], ignore_index=True)

### Checks

In [9]:
# df.shape
# df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199250 entries, 0 to 199249
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   index            199250 non-null  int64 
 1   subreddit        199250 non-null  object
 2   created_utc      199250 non-null  int64 
 3   author           199250 non-null  object
 4   num_comments     199250 non-null  int64 
 5   score            199250 non-null  int64 
 6   is_self          199250 non-null  bool  
 7   link_flair_text  29684 non-null   object
 8   title            199250 non-null  object
 9   selftext         168312 non-null  object
 10  full_link        199250 non-null  object
 11  date             199250 non-null  object
 12  time             199250 non-null  object
dtypes: bool(1), int64(4), object(8)
memory usage: 18.4+ MB


### Detecting Text Encoding

**Using Chardetect terminal command**

In [10]:
# ! chardetect ../data/raw/raw_nfl_v4.csv  

**Using `with open()`**

In [11]:
# with open('../data/raw/raw_nfl_v4.csv') as f:
#     print(f)

### Renaming PremierLeague Subreddit to 'epl'

In [12]:
df['subreddit'].value_counts()

nfl              99661
PremierLeague    99589
Name: subreddit, dtype: int64

In [13]:
df['subreddit'] = df['subreddit'].map(lambda x: 'epl' if x == 'PremierLeague' else x)

In [14]:
df['subreddit'].value_counts()

nfl    99661
epl    99589
Name: subreddit, dtype: int64

### Column Sorting and Filtering

As I cleaned, I realized that there were some columns that I didn't ultimately need, so I filtered out some of the columns that I initially included in my scraped data and resorted the remaining columns for ease of viewing. 

In [15]:
df = df[['subreddit', 'created_utc', 'date', 'time', 'link_flair_text', 'author', 'score', 'num_comments', 'index',  'title', 'selftext']]

In [16]:
# df.info()

In [17]:
df['subreddit'].value_counts()

nfl    99661
epl    99589
Name: subreddit, dtype: int64

As shown by the value counts above, I'm starting out with about 100,000 posts for each subreddit. I originally started with fewer, but after I began cleaning, I was quickly running of posts that that had the conditions I wanted. I returned to my Data Collection notebook and increased the number of posts to request from the API so that I'd begin my cleaning with a much greater volume of posts than I would eventually need. That way, I could be more decisive in dropping rows rather than trying to salvage content from posts that had incomplete or low quality information.

# **Basic Cleaning**

### Nulls

Nulls only exist in two columns: `link_flair_text` and `selftext`. 

I knew I had enough data that I could drop all of the posts with empty `selftext` fields, but I didn't want to lose the posts without tags (there are many). So I put 'none' into the `link_flair_text` fields and removed all rows with nulls after that, which left about 80,000 posts per subreddit.

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199250 entries, 0 to 199249
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   subreddit        199250 non-null  object
 1   created_utc      199250 non-null  int64 
 2   date             199250 non-null  object
 3   time             199250 non-null  object
 4   link_flair_text  29684 non-null   object
 5   author           199250 non-null  object
 6   score            199250 non-null  int64 
 7   num_comments     199250 non-null  int64 
 8   index            199250 non-null  int64 
 9   title            199250 non-null  object
 10  selftext         168312 non-null  object
dtypes: int64(4), object(7)
memory usage: 16.7+ MB


In [19]:
df['link_flair_text'].fillna('none', inplace=True)

In [20]:
# df.info()

In [21]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168312 entries, 0 to 199248
Data columns (total 11 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   subreddit        168312 non-null  object
 1   created_utc      168312 non-null  int64 
 2   date             168312 non-null  object
 3   time             168312 non-null  object
 4   link_flair_text  168312 non-null  object
 5   author           168312 non-null  object
 6   score            168312 non-null  int64 
 7   num_comments     168312 non-null  int64 
 8   index            168312 non-null  int64 
 9   title            168312 non-null  object
 10  selftext         168312 non-null  object
dtypes: int64(4), object(7)
memory usage: 15.4+ MB


In [22]:
df['subreddit'].value_counts()

epl    86351
nfl    81961
Name: subreddit, dtype: int64

### Simple Duplicates

Dropping duplicates brings down PremierLeague posts to a good range, but the number of NFL posts is still much greater than necessary. As I move forward, I'll work on bringing down the number of NFL posts to at least roughly match that of the PremierLeague posts.

In [23]:
df.drop_duplicates(subset=['title'], inplace=True)
df.shape

(59730, 11)

In [24]:
df['subreddit'].value_counts()

nfl    53173
epl     6557
Name: subreddit, dtype: int64

In [25]:
df.drop_duplicates(subset=['selftext'], inplace=True)
df.shape

(56178, 11)

In [26]:
df['subreddit'].value_counts()

nfl    49832
epl     6346
Name: subreddit, dtype: int64

### Posts with deleted body text (selftext)

In [27]:
df.drop(axis=0, 
        labels=df[df['selftext'].str.startswith('[deleted]')].index, # Submissions with deleted selftext
        inplace=True)

df['subreddit'].value_counts()

nfl    49831
epl     6324
Name: subreddit, dtype: int64

### Posts with Markdown tables

In [28]:
markdowns = df[df['selftext'].str.contains('\|')]

In [29]:
markdowns['subreddit'].value_counts()

nfl    3906
epl     201
Name: subreddit, dtype: int64

In [30]:
df.drop(axis=0, labels=markdowns.index, inplace=True)

### Remove URLs

In [31]:
# df['selftext'] = df['selftext'].replace('http\S+', '', regex=True).replace('www\S+', '', regex=True)
# df['title'] = df['title'].replace('http\S+', '', regex=True).replace('www\S+', '', regex=True)

# https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/40823105#40823105

In [32]:
def remove_url(text):
    import re
    return re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)

# Gwens string didn't work for me:
# r'^https?:\/\/.*[\r\n]*'

sentence = 'I get help from https://stackoverflow.com and I learn a lot reading on https://towardsdatascience.com'

remove_url(sentence)

'I get help from  and I learn a lot reading on '

In [33]:
df['selftext'] = df['selftext'].map(lambda x: remove_url(x))

In [34]:
# df['selftext'][:5]

# **Deep Cleaning**

## Repeated Post Titles

### Overview

Before I got the duplicates removal code working above, I manually went through a list of the most commonly repeating post `'selftext'` and cleared them out with a list and a for-loop. I'm going to remove a bulk of that code since it's no longer necessary, but some of it is still relevant, even after removing duplicates with the code above.

In [35]:
# df['selftext'].value_counts()[:30]

In [36]:
# df.head()

After looking at the value counts for the 'selftext' field, I realized that there were a lot of posts that had the exact same body text. So I used the following code to look at the top 20 most common titles and there were quite a few that were re-used many times. 

In [37]:
# df['title'].value_counts()[:20]

Readable results from above code:

| Post Title                                         | Count | ┃ | Post Title                          | Count | ┃ | Post Title                                   | Count |
|:---------------------------------------------------|:------|:-:|:------------------------------------|:------|:-:|:---------------------------------------------|:------|
| Shitpost Saturday                                  | 174   | ┃ | Talko Tuesday                       | 124   | ┃ | r/PremierLeague Midweek Musings              | 13    |
| Water Cooler Wednesday                             | 159   | ┃ | r/PremierLeague Daily Discussion    | 71    | ┃ | Weekly /r/PremierLeague Subreddit Suggestion | 11    |
| Free Talk Friday                                   | 158   | ┃ | This Week's Top /r/NFL [Highlight]s | 22    | ┃ | Test                                         | 11    |
| Sunday Brunch                                      | 157   | ┃ | Weekend Wrap Up                     | 21    | ┃ | Daily Open Discussion Thread                 | 11    |
| Thursday Talk Thread... Yes That's The Thread Name | 141   | ┃ | Question                            | 15    | ┃ | Weekly Transfer Discussion Thread            | 8     |
| Weekend Wrapup                                     | 130   | ┃ | NFL Power Rankings (Combined)       | 15    | ┃ | test                                         | 7     |

They fell into several categories:
1. Open threads meant for discussion of topics of any kind, even if unrelated to the topic of the subreddit.
2. Discussion threads in which the topics might be related, but all of the content is in the comments rather than the body of the post
3. Posts with code and/or little-to-no useful content
4. Commonly used titles by different users to introduce a topic-relevant post

### Removing the posts

In [38]:
repeat_titles = ["Shitpost Saturday", "Water Cooler Wednesday", "Free Talk Friday", "Sunday Brunch", 
                 "Thursday Talk Thread... Yes That's The Thread Name", "Weekend Wrapup", "Talko Tuesday",
                 "r/PremierLeague Daily Discussion", "This Week's Top /r/NFL [Highlight]s", 
                 "Weekend Wrap Up", "NFL Power Rankings (Combined)", "r/PremierLeague Midweek Musings", 
                 "Whose Line is it Anyways Wednesday--Offseason Edition", 
                 "Weekly /r/PremierLeague Subreddit Suggestion", "Test", "Daily Open Discussion Thread",
                 "Weekly Transfer Discussion Thread", "test", "Your Weekly /r/nfl Recap", 
                 "NFL Power Rankings (Combined) Week 0",
                 "Should Ole stay at Manchester United or not? If he got sacked by the board, who will be the best replacement. Comment your thoughts below"
                ]

In [39]:
for title in repeat_titles:
    title_df = df[df['title'] == title]  
    df.drop(axis=0, labels=title_df.index, inplace=True)

In [40]:
# df['title'].value_counts()[:20]

In [41]:
df['subreddit'].value_counts()

nfl    45921
epl     6117
Name: subreddit, dtype: int64

## PremierLeague Poll Posts

I found 91 posts from the PremierLeague subreddit that contained a lot of unnecessary information and formatting was such that vectorizing would be significantly more complicated. I decided to simply remove them for simplicity.

In [42]:
len('[View Poll](https://www.reddit.com/poll/g437k5)')

47

In [43]:
poll_posts = df[df['selftext'].str.startswith('  [View Poll]')]

In [44]:
poll_posts.shape

(0, 11)

In [45]:
df.drop(axis=0, labels=poll_posts.index, inplace=True)

In [46]:
df['subreddit'].value_counts()

nfl    45921
epl     6117
Name: subreddit, dtype: int64

## PremierLeague Match Threads

I found 91 posts from the PremierLeague subreddit that contained a lot of unnecessary information and formatting was such that vectorizing would be significantly more complicated. I decided to simply remove them for simplicity.

In [47]:
match_thread_titles = ('[Match Thread]', 
                       '[Match thread]', 
                       '[match Thread]', 
                       '[match thread]')

In [48]:
df['title'].str.startswith(match_thread_titles).value_counts()

False    52029
True         9
Name: title, dtype: int64

In [49]:
df['title'].str.startswith(match_thread_titles).value_counts()

False    52029
True         9
Name: title, dtype: int64

In [50]:
match_threads = df[df['title'].str.startswith(match_thread_titles)]

In [51]:
match_threads.shape

(9, 11)

In [52]:
df['subreddit'].value_counts()

nfl    45921
epl     6117
Name: subreddit, dtype: int64

## Removing NFL posts with Tags

Game Thread, Serious, Look Here!, and others 

In [53]:
df_nfl['link_flair_text'].value_counts()[:10]

Look Here!                        1805
Game Thread                       1061
Serious                            748
Free Talk                          492
Free talk                          437
Removed: Rule 2 - Invalid Post     116
Post Game Thread                    72
Trash Talk                          60
Look Here                           51
game                                42
Name: link_flair_text, dtype: int64

In [54]:
nfl_with_tags = df[(df['link_flair_text'] != 'none') & (df['subreddit'] == 'nfl')]
df.drop(axis=0, labels=nfl_with_tags.index, inplace=True)

In [55]:
df['link_flair_text'].value_counts()[:10]

none                       47228
Discussion                  1583
Question                     919
Poll                         564
:xpl: Premier League         246
News                          79
:mun: Manchester United       72
:liv: Liverpool               62
:ars: Arsenal                 57
:che: Chelsea                 53
Name: link_flair_text, dtype: int64

In [56]:
df.drop(axis=0, labels=df[(df['link_flair_text'] == 'Poll')].index, inplace=True)

df['subreddit'].value_counts()

nfl    45124
epl     5553
Name: subreddit, dtype: int64

## NFL Posts Filtered by String Length

In order to reduce the number of posts I had from the r/nfl, I decided to filter based on length.

First, I created a DataFrame that contained only nfl posts.

In [57]:
df_filtered = df[df['subreddit'] == 'nfl']

df_filtered.shape

(45124, 11)

I then created a second DataFrame that contained only the rows I want to keep (rows with post lengths between 500 and 1200 characters was where I landed after several tests until I got the count of NFL posts down to a similar number as that of the EPL posts.

Taking this slightly roundabout way allowed me to see the number of posts I'd have remaining once I removed the excess from the main DataFrame.

In [58]:
df_lengthlimits = df[(df['selftext'].str.len()>500) & \
                     (df['selftext'].str.len()<1200) & \
                     (df['subreddit'] == 'nfl')]
df_lengthlimits.shape

(6981, 11)

Using the index of that second DataFrame, I removed all the posts I want to keep from the DataFrame I created above: 'df_filtered', thus giving me a DataFrame containing all of the posts I want to exclude. 

In [59]:
df_filtered.drop(axis=0, labels=df_lengthlimits.index, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [60]:
df_filtered.shape

(38143, 11)

With that filtered DataFrame, I was able to use its index to drop all of the unwanted posts from the primary DataFrame.

In [61]:
df.drop(axis=0, labels=df_filtered.index, inplace=True)

In [62]:
df['subreddit'].value_counts()

nfl    6981
epl    5553
Name: subreddit, dtype: int64

Later on, I realized that there some extremely long posts (upwards of 25,000-30,000 characters) on the epl page that really skewed my distributions in the EDA section. I came back to remove those outliers and then move forward again from here.

The content is valuable, though, so I didn't want to trim too much. rather than trimming as far as a max of 1200 characters like I did for the NFL posts, I cut it off at 3,000. The distribution will still be off, but not nearly to the severe degree it was before.

In [63]:
df[(df['subreddit'] == 'epl') & (df['selftext'].str.len()>3000)].shape

(118, 11)

In [64]:
df = df[df['selftext'].str.len()<3000]

In [65]:
df.shape

(12416, 11)

In [66]:
df['subreddit'].value_counts()

nfl    6981
epl    5435
Name: subreddit, dtype: int64

# Re-Indexing

After removing so many rows, the DataFrame's index had gaps in its sequencing, so I decided to reset it to clean it up.

In [67]:
pd.reset_option('display.max_colwidth')

In [68]:
df[5:9]

Unnamed: 0,subreddit,created_utc,date,time,link_flair_text,author,score,num_comments,index,title,selftext
7,epl,1619271770,2021-04-24,06:42:50,Discussion,CC-33,1,3,7,My thoughts and prayers are with Jurgen Klopp ...,"Imagine being Jurgen Klopp right now, arguably..."
9,epl,1619278607,2021-04-24,08:36:47,Discussion,Cheerful_Jerry9603,1,2,9,Premier League Players who should finish off t...,Chinese Super League is known to be the last p...
14,epl,1619283368,2021-04-24,09:56:08,Question,alphaftw1,1,8,14,"Hypothetical situation, what happens if both c...","So let’s say this season, arsenal win the euro..."
15,epl,1619283692,2021-04-24,10:01:32,Question,imjonathvn,1,24,15,"Norwich, Watford, and Bournemouth might all ge...",Norwich and Watford have already been promoted...


The 'index' column can serve as a record of each posts original index number in case it's ever needed going forward.

In [69]:
df.reset_index(drop=True, inplace=True)

In [70]:
df[5:9]

Unnamed: 0,subreddit,created_utc,date,time,link_flair_text,author,score,num_comments,index,title,selftext
5,epl,1619271770,2021-04-24,06:42:50,Discussion,CC-33,1,3,7,My thoughts and prayers are with Jurgen Klopp ...,"Imagine being Jurgen Klopp right now, arguably..."
6,epl,1619278607,2021-04-24,08:36:47,Discussion,Cheerful_Jerry9603,1,2,9,Premier League Players who should finish off t...,Chinese Super League is known to be the last p...
7,epl,1619283368,2021-04-24,09:56:08,Question,alphaftw1,1,8,14,"Hypothetical situation, what happens if both c...","So let’s say this season, arsenal win the euro..."
8,epl,1619283692,2021-04-24,10:01:32,Question,imjonathvn,1,24,15,"Norwich, Watford, and Bournemouth might all ge...",Norwich and Watford have already been promoted...


# **Column Clean-up**

In [71]:
pd.reset_option('display.max_colwidth')

## Renaming 'selftext' to 'post'

In [72]:
df['post'] = df['selftext']

In [73]:
df.drop(columns='selftext', inplace=True)

## Merging 'title' and 'post' into an 'alltext' column

In [74]:
df['alltext'] = df['title'] + ' ' + df['post']

In [75]:
pd.set_option('display.max_colwidth', None)

In [76]:
# checks
pd.DataFrame(df.iloc[343]).T[['title', 'post', 'alltext']]

Unnamed: 0,title,post,alltext
343,If anyone thought Chris Wilder wasn’t doing a good job.,"If anyone thought that Chris Wilder wasn’t doing a good job then you have to look no further than today’s game. Leicester played brilliantly of course, but we were abysmal. Since we were promoted to the premier league a couple of seasons ago we haven’t lost a game by more than a 3 goal margin. The first game Wilder isn’t in charge and we let 5 in. This was one of the best defences in the league last season. And to add insult to injury, we had 1 shot in the whole damn game. Of course we haven’t been prolific all season and haven’t scored enough goals. But one damn shot the whole game is pathetic. To me this just shows how good a manager Wilder is, he took league 1 players and made them perform way above themselves. Without him we will absolutely capitulate.","If anyone thought Chris Wilder wasn’t doing a good job. If anyone thought that Chris Wilder wasn’t doing a good job then you have to look no further than today’s game. Leicester played brilliantly of course, but we were abysmal. Since we were promoted to the premier league a couple of seasons ago we haven’t lost a game by more than a 3 goal margin. The first game Wilder isn’t in charge and we let 5 in. This was one of the best defences in the league last season. And to add insult to injury, we had 1 shot in the whole damn game. Of course we haven’t been prolific all season and haven’t scored enough goals. But one damn shot the whole game is pathetic. To me this just shows how good a manager Wilder is, he took league 1 players and made them perform way above themselves. Without him we will absolutely capitulate."


In [77]:
len(pd.DataFrame(df.iloc[343]).T['title'][343]) + len(pd.DataFrame(df.iloc[343]).T['post'][343]), \
len(pd.DataFrame(df.iloc[343]).T['alltext'][343])


(821, 822)

## Renaming 'num_comments' to 'comments'

Shortening column name

In [78]:
df['comments'] = df['num_comments']

df.drop(columns='num_comments', inplace=True)

## Renaming 'link_flair_text' to 'tag'

In [79]:
df['tag'] = df['link_flair_text']

df.drop(columns='link_flair_text', inplace=True)

## Creating 'target' column

Here I created a column that represents each row's subreddits as a 1 or 0, which will allow our models to easily recognize and interact with the data.

In [80]:
df['target'] = df['subreddit'].map(lambda x: 1 if x == 'nfl' else 0)

## Re-ordering columns

Old Order:

'subreddit',  
'created_utc', 'date', 'time',  
'link_flair_text', 'author', 'score', 'num_comments',  
'index',  'title', 'selftext'


In [81]:
df = df[[
        'subreddit', 
         'target', 
         'author', 
         'score', 
         'comments', 
         'tag', 
         'index',
         'created_utc', 
         'date', 
         'time',
         'title', 
         'post', 
         'alltext'
        ]]

In [82]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12416 entries, 0 to 12415
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subreddit    12416 non-null  object
 1   target       12416 non-null  int64 
 2   author       12416 non-null  object
 3   score        12416 non-null  int64 
 4   comments     12416 non-null  int64 
 5   tag          12416 non-null  object
 6   index        12416 non-null  int64 
 7   created_utc  12416 non-null  int64 
 8   date         12416 non-null  object
 9   time         12416 non-null  object
 10  title        12416 non-null  object
 11  post         12416 non-null  object
 12  alltext      12416 non-null  object
dtypes: int64(5), object(8)
memory usage: 1.2+ MB


In [83]:
df['subreddit'].value_counts()

nfl    6981
epl    5435
Name: subreddit, dtype: int64

In [84]:
df['subreddit'].value_counts(normalize=True)

nfl    0.562258
epl    0.437742
Name: subreddit, dtype: float64

In [85]:
pd.reset_option('display.max_colwidth')
df[1974:1980]

Unnamed: 0,subreddit,target,author,score,comments,tag,index,created_utc,date,time,title,post,alltext
1974,epl,0,Year-Representative,1,11,Question,10267,1601401634,2020-09-29,10:47:14,Someone please explain me the Handball rule,What constitutes a handball? What is considere...,Someone please explain me the Handball rule Wh...
1975,epl,0,maseone2nine,1,2,Discussion,10270,1601405960,2020-09-29,11:59:20,Spurs crowd noise is ATROCIOUS,Every Spurs home game they absolutely BLAST th...,Spurs crowd noise is ATROCIOUS Every Spurs hom...
1976,epl,0,roots235,1,14,Discussion,10272,1601413758,2020-09-29,14:09:18,Better epl duo? Ronaldo and rooney vs mo and s...,"For me, rooney and ronaldo had better link up ...",Better epl duo? Ronaldo and rooney vs mo and s...
1977,epl,0,Wingz12_,1,0,none,10273,1601414361,2020-09-29,14:19:21,Are we going to waste this generation at Chelsea?,(Commenting this here because the lads at r/ch...,Are we going to waste this generation at Chels...
1978,epl,0,Messe_Lingard,1,5,Question,10274,1601414505,2020-09-29,14:21:45,Is $25 a good price for a New Jersey from an I...,Would you pay $25 from an Insta seller? I’ve b...,Is $25 a good price for a New Jersey from an I...
1979,epl,0,TeddyMMR,1,7,:xpl: Premier League,10276,1601417535,2020-09-29,15:12:15,People are giving Liverpool the league already,but am I missing something? \n\n* An unconvinc...,People are giving Liverpool the league already...


---

# **Final Cleaning**

Once columns were cleaned up and the `'alltext'` column was made (from `'title'` and `'selftext'`), I did one more pass over the new column to remove specific strings of text.

## Subreddit name strings

The names of the subreddits included in titles and body text are likely to be obvious tells for classification, which will be great for our model - helping to ensure high accuracy classification of posts for OverArmor. 

For EDA, however, removing them might be better, as it will give us a cleaner look at the common vernacular of each community.

In [86]:
titlecount_alltext_nfl = df[df['alltext'].str.contains('r/nfl')].shape[0] + df[df['alltext'].str.contains('r/NFL')].shape[0]
titlecount_alltext_epl = df[df['alltext'].str.contains('r/premierleague')].shape[0] + df[df['alltext'].str.contains('r/PremierLeague')].shape[0]

In [87]:
print(f"Count of 'r/nfl' in 'alltext' column: {titlecount_alltext_nfl}")
print(f"Count of 'r/PremierLeague' in 'alltext' column: {titlecount_alltext_epl}")

Count of 'r/nfl' in 'alltext' column: 211
Count of 'r/PremierLeague' in 'alltext' column: 28


In [88]:
# we can remove these for EDA later
# nfl_titles, epl_titles, blanks = ('r/nfl', 'r/NFL'), ('r/premierleague', 'r/PremierLeague'), ('','')
# df.replace(nfl_titles, blanks, inplace=True)
# df.replace(epl_titles, blanks, inplace=True)

In [89]:
# df.replace(nfl_titles, blanks, inplace=True)

In [90]:
# df.replace(epl_titles, blanks, inplace=True)

In [91]:
# df['alltext'][100:120]

## Punctuation

For simplicity, I replaced the code for various symbols with spaces. This will help me get at clean tokens when I get to the point of tokenizing and vectorizing for analysis.  

My first attempts to replace the shortcodes below were unsuccessful (including a loop that would have made the process more efficient), so I searched for and found a good solution using some basic RegEx and kept it repetitive and simple.

https://stackoverflow.com/questions/44227748/removing-newlines-from-messy-strings-in-pandas-dataframe-cells

In [92]:
# \r (return)
df.replace('\r',' ', regex=True, inplace=True) 

# \n (line break)
df.replace('\n',' ', regex=True, inplace=True)   

# \t (tab)
df.replace('\t',' ', regex=True, inplace=True)   

# &amp; (&)
df.replace('&amp;',' ', regex=True, inplace=True)   

# &nbsp; (space)
df.replace('&nbsp;',' ', regex=True, inplace=True)  

# nbsp; (space, chained to other code)
df.replace('nbsp;',' ', regex=True, inplace=True)    

# # &gt; and &lt; (> and <)
df.replace('&lt;','', regex=True, inplace=True)   
df.replace('&gt;','', regex=True, inplace=True) 

# #x200b    # not working even after trying many variations of the code. see note below
df.replace('x200B','', inplace=True) 

# '  (apostrophe)
df.replace("'*", '', regex=True, inplace=True)

# **
df.replace('[\*\*]', '', regex=True, inplace=True) 

# ‚Äì 
df.replace('‚Äì', '', inplace=True)

# [  (left bracket)
df.replace('\[', '', inplace=True)

The string 'x200b' represents a zero width space, but doesn't seem to become that literal string until it is actually displayed. My searches suggest that the issue might have something to do with changes in text encoding throughout the collection, ingestion, and cleaning process. More research warranted here. 

# **Exports**

In [93]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12416 entries, 0 to 12415
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   subreddit    12416 non-null  object
 1   target       12416 non-null  int64 
 2   author       12416 non-null  object
 3   score        12416 non-null  int64 
 4   comments     12416 non-null  int64 
 5   tag          12416 non-null  object
 6   index        12416 non-null  int64 
 7   created_utc  12416 non-null  int64 
 8   date         12416 non-null  object
 9   time         12416 non-null  object
 10  title        12416 non-null  object
 11  post         12416 non-null  object
 12  alltext      12416 non-null  object
dtypes: int64(5), object(8)
memory usage: 1.2+ MB


## Clean Export for EDA and Modeling

In [94]:
df.to_csv('../data/2_clean/reddit_posts_clean.csv', index=False)

Having collected and cleaned our data, I completed my delieverable for OverArmor's first request. 

It remains to be seen how my models will do, but based on the way the data looks, I expect decent results. I think there are strong enough differences between the language used in these subreddits that the model will be able to do a good job. 

Team names, city names, unique words 