<span style="font-family:Trebuchet MS; font-size:2em;">Project 3 | NB2: Cleaning and Preprocessing</span>

Riley Robertson | Reddit Classification Project | 

# Imports and setup

## Module Imports

I began my process by importing basic libraries and as I cleaned, I returned to add modules as necessary. I also set preferences, assigned variables, imported my data, and set up my main dataframe so I could begin cleaning.

In [1]:
# basics
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re

# custom
import utilities.densmore as dns

# date and time 
import datetime as dt
import time

# for CVEC test
from sklearn.feature_extraction.text import CountVectorizer

## Module Preferences

In [2]:
pd.set_option('display.max_colwidth', None)

---

## Data Import and Setup

In [3]:
df_nfl = pd.read_csv('../data/raw/raw_nfl_v4.csv', low_memory=False)
df_epl = pd.read_csv('../data/raw/raw_epl_v4.csv', low_memory=False)

In [4]:
df_nfl.shape, df_epl.shape

((99661, 13), (99589, 13))

In [5]:
# df_nfl_full = pd.read_csv('../git_ignore/output/raw_nfl_v4_full.csv', low_memory=False)
# df_epl_full = pd.read_csv('../git_ignore/output/raw_epl_v4_full.csv', low_memory=False)

In [6]:
# df_nfl_full.shape, df_epl_full.shape

### Merging the DataFrames

In [7]:
df = pd.concat([df_epl, df_nfl], ignore_index=True)

In [8]:
# df_full = pd.concat([df_epl_full, df_nfl_full], ignore_index=True)

### Checks

In [9]:
# df.shape
# df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199250 entries, 0 to 199249
Data columns (total 13 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   index            199250 non-null  int64 
 1   subreddit        199250 non-null  object
 2   created_utc      199250 non-null  int64 
 3   author           199250 non-null  object
 4   num_comments     199250 non-null  int64 
 5   score            199250 non-null  int64 
 6   is_self          199250 non-null  bool  
 7   link_flair_text  29684 non-null   object
 8   title            199250 non-null  object
 9   selftext         168312 non-null  object
 10  full_link        199250 non-null  object
 11  date             199250 non-null  object
 12  time             199250 non-null  object
dtypes: bool(1), int64(4), object(8)
memory usage: 18.4+ MB


### Detecting Text Encoding

**Using Chardetect terminal command**

In [10]:
# ! chardetect ../data/raw/raw_nfl_v4.csv  

**Using `with open()`**

In [11]:
# with open('../data/raw/raw_nfl_v4.csv') as f:
#     print(f)

### Renaming PremierLeague Subreddit to 'epl'

In [12]:
df['subreddit'].value_counts()

nfl              99661
PremierLeague    99589
Name: subreddit, dtype: int64

In [13]:
df['subreddit'] = df['subreddit'].map(lambda x: 'epl' if x == 'PremierLeague' else x)

In [14]:
df['subreddit'].value_counts()

nfl    99661
epl    99589
Name: subreddit, dtype: int64

### Column Sorting and Filtering

As I cleaned, I realized that there were some columns that I didn't ultimately need, so I filtered out some of the columns that I initially included in my scraped data and resorted the remaining columns for ease of viewing. 

In [15]:
df = df[['subreddit', 'created_utc', 'link_flair_text', 'author', 'score', 'num_comments', 'index',  'title', 'selftext']]

In [16]:
# df.info()

In [17]:
df['subreddit'].value_counts()

nfl    99661
epl    99589
Name: subreddit, dtype: int64

As shown by the value counts above, I'm starting out with about 100,000 posts for each subreddit. I originally started with fewer, but after I began cleaning, I was quickly running of posts that that had the conditions I wanted. I returned to my Data Collection notebook and increased the number of posts to request from the API so that I'd begin my cleaning with a much greater volume of posts than I would eventually need. That way, I could be more decisive in dropping rows rather than trying to salvage content from posts that had incomplete or low quality information.

# Basic Cleaning

### Nulls

Nulls only exist in two columns: `link_flair_text` and `selftext`. 

I knew I had enough data that I could drop all of the posts with empty `selftext` fields, but I didn't want to lose the posts without tags (there are many). So I put 'none' into the `link_flair_text` fields and removed all rows with nulls after that, which left about 80,000 posts per subreddit.

In [18]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 199250 entries, 0 to 199249
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   subreddit        199250 non-null  object
 1   created_utc      199250 non-null  int64 
 2   link_flair_text  29684 non-null   object
 3   author           199250 non-null  object
 4   score            199250 non-null  int64 
 5   num_comments     199250 non-null  int64 
 6   index            199250 non-null  int64 
 7   title            199250 non-null  object
 8   selftext         168312 non-null  object
dtypes: int64(4), object(5)
memory usage: 13.7+ MB


In [19]:
df['link_flair_text'].fillna('none', inplace=True)

In [20]:
# df.info()

In [21]:
df.dropna(inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 168312 entries, 0 to 199248
Data columns (total 9 columns):
 #   Column           Non-Null Count   Dtype 
---  ------           --------------   ----- 
 0   subreddit        168312 non-null  object
 1   created_utc      168312 non-null  int64 
 2   link_flair_text  168312 non-null  object
 3   author           168312 non-null  object
 4   score            168312 non-null  int64 
 5   num_comments     168312 non-null  int64 
 6   index            168312 non-null  int64 
 7   title            168312 non-null  object
 8   selftext         168312 non-null  object
dtypes: int64(4), object(5)
memory usage: 12.8+ MB


In [22]:
df['subreddit'].value_counts()

epl    86351
nfl    81961
Name: subreddit, dtype: int64

### Simple Duplicates

Dropping duplicates brings down PremierLeague posts to a good range, but the number of NFL posts is still much greater than necessary. As I move forward, I'll work on bringing down the number of NFL posts to at least roughly match that of the PremierLeague posts.

In [23]:
df.drop_duplicates(subset=['title'], inplace=True)
df.shape

(59730, 9)

In [24]:
df['subreddit'].value_counts()

nfl    53173
epl     6557
Name: subreddit, dtype: int64

In [25]:
df.drop_duplicates(subset=['selftext'], inplace=True)
df.shape

(56178, 9)

In [26]:
df['subreddit'].value_counts()

nfl    49832
epl     6346
Name: subreddit, dtype: int64

### Posts with deleted body text (selftext)

In [27]:
df.drop(axis=0, 
        labels=df[df['selftext'].str.startswith('[deleted]')].index, # Submissions with deleted selftext
        inplace=True)

df['subreddit'].value_counts()

nfl    49831
epl     6324
Name: subreddit, dtype: int64

### Posts with Markdown tables

In [28]:
markdowns = df[df['selftext'].str.contains('\|')]

In [29]:
markdowns['subreddit'].value_counts()

nfl    3906
epl     201
Name: subreddit, dtype: int64

In [30]:
df.drop(axis=0, labels=markdowns.index, inplace=True)

### Remove URLs

In [31]:
# df['selftext'] = df['selftext'].replace('http\S+', '', regex=True).replace('www\S+', '', regex=True)
# df['title'] = df['title'].replace('http\S+', '', regex=True).replace('www\S+', '', regex=True)

# https://stackoverflow.com/questions/11331982/how-to-remove-any-url-within-a-string-in-python/40823105#40823105

In [32]:
def remove_url(text):
    import re
    return re.sub(r'\w+:\/{2}[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', '', text)

# Gwens string didn't work for me:
# r'^https?:\/\/.*[\r\n]*'

sentence = 'I get help from https://stackoverflow.com and I learn a lot reading on https://towardsdatascience.com'

remove_url(sentence)

'I get help from  and I learn a lot reading on '

In [33]:
df['selftext'] = df['selftext'].map(lambda x: remove_url(x))

In [34]:
# df['selftext'][:5]

# Deep Cleaning

## Repeated Post Titles

### Overview

Before I got the duplicates removal code working above, I manually went through a list of the most commonly repeating post `'selftext'` and cleared them out with a list and a for-loop. I'm going to remove a bulk of that code since it's no longer necessary, but some of it is still relevant, even after removing duplicates with the code above.

In [35]:
# df['selftext'].value_counts()[:30]

In [36]:
# df.head()

After looking at the value counts for the 'selftext' field, I realized that there were a lot of posts that had the exact same body text. So I used the following code to look at the top 20 most common titles and there were quite a few that were re-used many times. 

In [37]:
# df['title'].value_counts()[:20]

Readable results from above code:

| Post Title                                         | Count | ┃ | Post Title                          | Count | ┃ | Post Title                                   | Count |
|:---------------------------------------------------|:------|:-:|:------------------------------------|:------|:-:|:---------------------------------------------|:------|
| Shitpost Saturday                                  | 174   | ┃ | Talko Tuesday                       | 124   | ┃ | r/PremierLeague Midweek Musings              | 13    |
| Water Cooler Wednesday                             | 159   | ┃ | r/PremierLeague Daily Discussion    | 71    | ┃ | Weekly /r/PremierLeague Subreddit Suggestion | 11    |
| Free Talk Friday                                   | 158   | ┃ | This Week's Top /r/NFL [Highlight]s | 22    | ┃ | Test                                         | 11    |
| Sunday Brunch                                      | 157   | ┃ | Weekend Wrap Up                     | 21    | ┃ | Daily Open Discussion Thread                 | 11    |
| Thursday Talk Thread... Yes That's The Thread Name | 141   | ┃ | Question                            | 15    | ┃ | Weekly Transfer Discussion Thread            | 8     |
| Weekend Wrapup                                     | 130   | ┃ | NFL Power Rankings (Combined)       | 15    | ┃ | test                                         | 7     |

They fell into several categories:
1. Open threads meant for discussion of topics of any kind, even if unrelated to the topic of the subreddit.
2. Discussion threads in which the topics might be related, but all of the content is in the comments rather than the body of the post
3. Posts with code and/or little-to-no useful content
4. Commonly used titles by different users to introduce a topic-relevant post

### Removing the posts

In [38]:
repeat_titles = ["Shitpost Saturday", "Water Cooler Wednesday", "Free Talk Friday", "Sunday Brunch", 
                 "Thursday Talk Thread... Yes That's The Thread Name", "Weekend Wrapup", "Talko Tuesday",
                 "r/PremierLeague Daily Discussion", "This Week's Top /r/NFL [Highlight]s", 
                 "Weekend Wrap Up", "NFL Power Rankings (Combined)", "r/PremierLeague Midweek Musings", 
                 "Whose Line is it Anyways Wednesday--Offseason Edition", 
                 "Weekly /r/PremierLeague Subreddit Suggestion", "Test", "Daily Open Discussion Thread",
                 "Weekly Transfer Discussion Thread", "test", "Your Weekly /r/nfl Recap", 
                 "NFL Power Rankings (Combined) Week 0",
                 "Should Ole stay at Manchester United or not? If he got sacked by the board, who will be the best replacement. Comment your thoughts below"
                ]

In [39]:
for title in repeat_titles:
    title_df = df[df['title'] == title]  
    df.drop(axis=0, labels=title_df.index, inplace=True)

In [40]:
# df['title'].value_counts()[:20]

In [41]:
df['subreddit'].value_counts()

nfl    45921
epl     6117
Name: subreddit, dtype: int64

## PremierLeague Poll Posts

I found 91 posts from the PremierLeague subreddit that contained a lot of unnecessary information and formatting was such that vectorizing would be significantly more complicated. I decided to simply remove them for simplicity.

In [42]:
len('[View Poll](https://www.reddit.com/poll/g437k5)')

47

In [43]:
poll_posts = df[df['selftext'].str.startswith('  [View Poll]')]

In [44]:
poll_posts.shape

(0, 9)

In [45]:
df.drop(axis=0, labels=poll_posts.index, inplace=True)

In [46]:
df['subreddit'].value_counts()

nfl    45921
epl     6117
Name: subreddit, dtype: int64

## PremierLeague Match Threads

I found 91 posts from the PremierLeague subreddit that contained a lot of unnecessary information and formatting was such that vectorizing would be significantly more complicated. I decided to simply remove them for simplicity.

In [47]:
match_thread_titles = ('[Match Thread]', 
                       '[Match thread]', 
                       '[match Thread]', 
                       '[match thread]')

In [48]:
df['title'].str.startswith(match_thread_titles).value_counts()

False    52029
True         9
Name: title, dtype: int64

In [49]:
df['title'].str.startswith(match_thread_titles).value_counts()

False    52029
True         9
Name: title, dtype: int64

In [50]:
match_threads = df[df['title'].str.startswith(match_thread_titles)]

In [51]:
match_threads.shape

(9, 9)

In [52]:
df['subreddit'].value_counts()

nfl    45921
epl     6117
Name: subreddit, dtype: int64

## Removing NFL posts with Tags

Game Thread, Serious, Look Here!, and others 

In [53]:
df_nfl['link_flair_text'].value_counts()[:10]

Look Here!                        1805
Game Thread                       1061
Serious                            748
Free Talk                          492
Free talk                          437
Removed: Rule 2 - Invalid Post     116
Post Game Thread                    72
Trash Talk                          60
Look Here                           51
game                                42
Name: link_flair_text, dtype: int64

In [54]:
nfl_with_tags = df[(df['link_flair_text'] != 'none') & (df['subreddit'] == 'nfl')]
df.drop(axis=0, labels=nfl_with_tags.index, inplace=True)

In [55]:
df['link_flair_text'].value_counts()[:10]

none                       47228
Discussion                  1583
Question                     919
Poll                         564
:xpl: Premier League         246
News                          79
:mun: Manchester United       72
:liv: Liverpool               62
:ars: Arsenal                 57
:che: Chelsea                 53
Name: link_flair_text, dtype: int64

In [56]:
df.drop(axis=0, labels=df[(df['link_flair_text'] == 'Poll')].index, inplace=True)

df['subreddit'].value_counts()

nfl    45124
epl     5553
Name: subreddit, dtype: int64

## NFL Posts Filtered by String Length

In [57]:
df_filtered = df[df['subreddit'] == 'nfl']
df_filtered.shape

(45124, 9)

In [58]:
df_lengthlimits = df[(df['selftext'].str.len()>500) & \
                     (df['selftext'].str.len()<1200) & \
                     (df['subreddit'] == 'nfl')]
df_lengthlimits.shape

(6981, 9)

In [59]:
df_filtered.drop(axis=0, labels=df_lengthlimits.index, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().drop(


In [60]:
df_filtered.shape

(38143, 9)

In [61]:
df.drop(axis=0, labels=df_filtered.index, inplace=True)

In [62]:
df['subreddit'].value_counts()

nfl    6981
epl    5553
Name: subreddit, dtype: int64

# Re-Indexing

After removing so many rows, the DataFrame's index had gaps in its sequencing, so I decided to reset it to clean it up.

In [63]:
pd.reset_option('display.max_colwidth')

In [64]:
df[5:9]

Unnamed: 0,subreddit,created_utc,link_flair_text,author,score,num_comments,index,title,selftext
7,epl,1619271770,Discussion,CC-33,1,3,7,My thoughts and prayers are with Jurgen Klopp ...,"Imagine being Jurgen Klopp right now, arguably..."
9,epl,1619278607,Discussion,Cheerful_Jerry9603,1,2,9,Premier League Players who should finish off t...,Chinese Super League is known to be the last p...
14,epl,1619283368,Question,alphaftw1,1,8,14,"Hypothetical situation, what happens if both c...","So let’s say this season, arsenal win the euro..."
15,epl,1619283692,Question,imjonathvn,1,24,15,"Norwich, Watford, and Bournemouth might all ge...",Norwich and Watford have already been promoted...


The 'index' column can serve as a record of each posts original index number in case it's ever needed going forward.

In [65]:
df.reset_index(drop=True, inplace=True)

In [66]:
df[5:9]

Unnamed: 0,subreddit,created_utc,link_flair_text,author,score,num_comments,index,title,selftext
5,epl,1619271770,Discussion,CC-33,1,3,7,My thoughts and prayers are with Jurgen Klopp ...,"Imagine being Jurgen Klopp right now, arguably..."
6,epl,1619278607,Discussion,Cheerful_Jerry9603,1,2,9,Premier League Players who should finish off t...,Chinese Super League is known to be the last p...
7,epl,1619283368,Question,alphaftw1,1,8,14,"Hypothetical situation, what happens if both c...","So let’s say this season, arsenal win the euro..."
8,epl,1619283692,Question,imjonathvn,1,24,15,"Norwich, Watford, and Bournemouth might all ge...",Norwich and Watford have already been promoted...


# Column Clean-up

In [67]:
pd.reset_option('display.max_colwidth')

**Converting `'created_utc'` to `'datetime'`**

In [68]:
df['datetime'] = df['created_utc'].map(lambda x: dt.datetime.fromtimestamp(x))

**Renaming `'selftext'` to `'post'`**

In [69]:
df['post'] = df['selftext']

In [70]:
df.drop(columns='selftext', inplace=True)

**Merging `'title'` and `'post'` into an `'alltext'` column**

In [71]:
df['alltext'] = df['title'] + ' ' + df['post']

In [72]:
pd.set_option('display.max_colwidth', None)

In [73]:
# checks
pd.DataFrame(df.iloc[343]).T[['title', 'post', 'alltext']]

Unnamed: 0,title,post,alltext
343,"Which player(s) currently at your club, if any, have the potential to go down as club legends?","When I say legends, I mean the likes of Moore for West Ham, Henry for Arsenal etc. and not cult heros (like Michu for Swansea). For us (the Hammers) Noble is basically a legend already, and I could definitely see Rice joining him dependent on how long he stays with us/if he achieves anything with us.","Which player(s) currently at your club, if any, have the potential to go down as club legends? When I say legends, I mean the likes of Moore for West Ham, Henry for Arsenal etc. and not cult heros (like Michu for Swansea). For us (the Hammers) Noble is basically a legend already, and I could definitely see Rice joining him dependent on how long he stays with us/if he achieves anything with us."


In [74]:
len(pd.DataFrame(df.iloc[343]).T['title'][343]) + len(pd.DataFrame(df.iloc[343]).T['post'][343]), \
len(pd.DataFrame(df.iloc[343]).T['alltext'][343])


(395, 396)

**Renaming `'num_comments'` to `'comments'`**

Shortening column name

In [75]:
df['comments'] = df['num_comments']

df.drop(columns='num_comments', inplace=True)

**Renaming `'link_flair_text'` to `'tag'`**

In [76]:
df['tag'] = df['link_flair_text']

df.drop(columns='link_flair_text', inplace=True)

**Creating `'target'` column**

In [77]:
df['target'] = df['subreddit'].map(lambda x: 0 if x == 'nfl' else 1)

**Re-ordering columns**

In [78]:
df = df[['subreddit', 'target', 'author', 'score', 'comments', 'tag', 'index', 'alltext']]

In [79]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12534 entries, 0 to 12533
Data columns (total 8 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   subreddit  12534 non-null  object
 1   target     12534 non-null  int64 
 2   author     12534 non-null  object
 3   score      12534 non-null  int64 
 4   comments   12534 non-null  int64 
 5   tag        12534 non-null  object
 6   index      12534 non-null  int64 
 7   alltext    12534 non-null  object
dtypes: int64(4), object(4)
memory usage: 783.5+ KB


In [80]:
df['subreddit'].value_counts()

nfl    6981
epl    5553
Name: subreddit, dtype: int64

In [81]:
df['subreddit'].value_counts(normalize=True)

nfl    0.556965
epl    0.443035
Name: subreddit, dtype: float64

In [82]:
pd.reset_option('display.max_colwidth')
df[1974:1980]

Unnamed: 0,subreddit,target,author,score,comments,tag,index,alltext
1974,epl,1,shahs210,1,0,News,10237,Free last man standing competition. Join if in...
1975,epl,1,imR0N,1,5,:liv: Liverpool,10238,Reasons behind signing Diogo Jota for 40 milli...
1976,epl,1,___ratsalad,1,0,:xpl: Premier League,10239,[New Series] The Football Book Club podcast He...
1977,epl,1,SmithBurger,1,1,Question,10240,When are replays available on peacock? The pea...
1978,epl,1,trashcan_paradise,1,33,none,10241,A suggestion for Americans looking for an EPL ...
1979,epl,1,zorfog,1,24,:xpl: Premier League,10242,Just a casual reminder to #BoycottPeacock This...


---

# Pre-processing

## Module Imports 

In [83]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction import text 

## Subreddit name strings

The names of the subreddits included in titles and body text are likely to be obvious tells for classification, which will be great for our model - helping to ensure high accuracy classification of posts for OverArmor. 

For EDA, however, removing them might be better, as it will give us a cleaner look at the common vernacular of each community.

In [84]:
titlecount_alltext_nfl = df[df['alltext'].str.contains('r/nfl')].shape[0] + df[df['alltext'].str.contains('r/NFL')].shape[0]
titlecount_alltext_epl = df[df['alltext'].str.contains('r/premierleague')].shape[0] + df[df['alltext'].str.contains('r/PremierLeague')].shape[0]

In [85]:
print(f"Count of 'r/nfl' in 'alltext' column: {titlecount_alltext_nfl}")
print(f"Count of 'r/PremierLeague' in 'alltext' column: {titlecount_alltext_epl}")

Count of 'r/nfl' in 'alltext' column: 211
Count of 'r/PremierLeague' in 'alltext' column: 29


In [86]:
# we can remove these for EDA later
# nfl_titles, epl_titles, blanks = ('r/nfl', 'r/NFL'), ('r/premierleague', 'r/PremierLeague'), ('','')
# df.replace(nfl_titles, blanks, inplace=True)
# df.replace(epl_titles, blanks, inplace=True)

In [87]:
# df.replace(nfl_titles, blanks, inplace=True)

In [88]:
# df.replace(epl_titles, blanks, inplace=True)

In [89]:
# df['alltext'][100:120]

## Punctuation

For simplicity, I'm replacing the code for various symbols with spaces. This will help me get at clean tokens when I get to the point of tokenizing and vectorizing for analysis.  

My first attempts to replace the shortcodes failed, so I went online and found a good solution using simple RegEx.

In [90]:
# https://stackoverflow.com/questions/44227748/removing-newlines-from-messy-strings-in-pandas-dataframe-cells

In [91]:
# \r (return)
df.replace('\r',' ', regex=True, inplace=True) 

# \n (line break)
df.replace('\n',' ', regex=True, inplace=True)   

# \t (tab)
df.replace('\t',' ', regex=True, inplace=True)   

# &amp; (&)
df.replace('&amp;',' ', regex=True, inplace=True)   

# &nbsp; (space)
df.replace('&nbsp;',' ', regex=True, inplace=True)  

# nbsp; (space, chained to other code)
df.replace('nbsp;',' ', regex=True, inplace=True)    

# # &gt; and &lt; (> and <)
df.replace('&lt;','', regex=True, inplace=True)   
df.replace('&gt;','', regex=True, inplace=True) 

# #x200b    # not working even after many iterations of the code 
df.replace('x200B','', inplace=True) 

# ' (apostrophe)
df.replace("'*", '', regex=True, inplace=True)

# **
df.replace('[\*\*]', '', regex=True, inplace=True) 

# 49ers # also not working. very confusing because it seems really simple.
df.replace('49ers', 'fortyniners', inplace=True) 

# ‚Äì 
df.replace('‚Äì', '', inplace=True)

# [ (left bracket)
df.replace('\[', '', inplace=True)

I also tried to loop through a list to make this more efficient, but had trouble getting it to work, possibly because the r-strings didn't come through the list correctly or something?

In [92]:
# symbols = ['&amp;', '&nbsp;', 'nbsp;', '&lt;', '&gt;', '\*\*']

# for symbol in symbols:
#     df.replace(symbol, ' ', regex=True, inplace=True)

In [93]:
# df['selftext'][:20]

There is still cleaning to be done with regard to symbols and symbol code, but I'll circle back to this once I've done more row removal, as I might end up removing the rows that have the problematic text strings.

## Transform

In [94]:
df_m = df[['subreddit', 'target', 'score', 'comments', 'tag', 'alltext']]

### GetDummies

In [95]:
df_m_dums = pd.get_dummies(df_m['tag'], drop_first=True)

In [96]:
df_m = pd.concat([df_m, df_m_dums], axis=1)

In [97]:
df_m.drop(columns=['tag'], inplace=True)

In [98]:
df_m.head()

Unnamed: 0,subreddit,target,score,comments,alltext,:ava: Aston Villa,:brh: Brighton Hove Albion,:bur: Burnley,:che: Chelsea,:cry: Crystal Palace,...,Question,Rumor,Tottenham Hotspur,Transfer News,Transfer Rumor,Watford FC,West Ham United,Who to Root for,Wolverhampton Wanderers,none
0,epl,1,1,164,One day I hope Mourinho will go somewhere wher...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,epl,1,1,3,Whats the best place to watch Premier league a...,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,epl,1,1,1,Forget obsessing about the ESL... the REAL pro...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,epl,1,1,16,Tough day to be a Liverpool supporter. 22 shot...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,epl,1,1,2,Justice served. Fuck VAR I can’t believe that ...,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### TFIDF

In [99]:
# df_m.columns

In [100]:
X = df_m.drop(columns=['subreddit', 'target'])
y = df_m['target']

In [101]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=74, stratify=y)

In [124]:
# X_train.head()
# X_test.head()

In [102]:
# add_stop_words = []
stpwds = text.ENGLISH_STOP_WORDS #.union(add_stop_words) # < uncomment to add custom stop words

In [103]:
# instantiate the transformer
tvec = TfidfVectorizer(stop_words=stpwds)

In [125]:
X_train_tvec = tvec.fit_transform(X_train['alltext'])
X_test_tvec = tvec.transform(X_test['alltext'])

In [126]:
type(X_train_tvec)

scipy.sparse.csr.csr_matrix

In [127]:
# convert training data to dataframe
X_train_df = pd.DataFrame(X_train_tvec.todense(), columns=tvec.get_feature_names())
X_train_df = pd.DataFrame(X_train_tvec.todense(), columns=tvec.get_feature_names())
X_train_df.head()

Unnamed: 0,00,000,00002,0009,000k,000s,000yd,001,003,005,...,좋은,중요하다,지고,하고,하는,하는데,해도,해오고,했는데,희생과
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [133]:
X_test_df = pd.DataFrame(X_test_tvec.todense(), columns=tvec.get_feature_names())
X_test_df = pd.DataFrame(X_test_tvec.todense(), columns=tvec.get_feature_names())
X_test_df.head()

Unnamed: 0,00,000,00002,0009,000k,000s,000yd,001,003,005,...,좋은,중요하다,지고,하고,하는,하는데,해도,해오고,했는데,희생과
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [134]:
X_train_df.reset_index(drop=True, inplace=True)
X_train.reset_index(drop=True, inplace=True)
X_test_df.reset_index(drop=True, inplace = True)
X_test.reset_index(drop=True, inplace=True)

In [135]:
X_train_all = pd.concat([X_train, X_train_df],axis = 1)
X_train_all.head()

Unnamed: 0,score,comments,alltext,:ava: Aston Villa,:brh: Brighton Hove Albion,:bur: Burnley,:che: Chelsea,:cry: Crystal Palace,:eve: Everton,:ful: Fulham,...,좋은,중요하다,지고,하고,하는,하는데,해도,해오고,했는데,희생과
0,1,40,If your club can choose pick for free any pl...,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,105,238,"Conversely, every season it seems like one or ...",0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,258,81,Nick Foles could make Playoff History tomorrow...,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,32,59,Patriots had only two pro bowlers but all thei...,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,17,Where will Sheffield United finish this year? ...,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [136]:
X_test_all = pd.concat([X_test, X_test_df],axis = 1)
X_test_all.head()

Unnamed: 0,score,comments,alltext,:ava: Aston Villa,:brh: Brighton Hove Albion,:bur: Burnley,:che: Chelsea,:cry: Crystal Palace,:eve: Everton,:ful: Fulham,...,좋은,중요하다,지고,하고,하는,하는데,해도,해오고,했는데,희생과
0,0,0,MANCHESTER UNITED LACK OF SUMMER BUSINESS COST...,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1,32,Rivalry help for an American Manchester United...,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1,6,Which of the following young players have been...,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1,0,Premier League or A-League? Which do you prefer?,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1,34,Have defenses just absolutely been getting tor...,0,0,0,0,0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Exports

## EDA Export

In [99]:
df.to_csv('../data/reddit_posts_clean_eda.csv', index=False)

## Model Export

In [100]:
df_m.to_csv('../data/reddit_posts_clean_modeling.csv', index=False)

In [101]:
# df[df['post'].str.len()>2000]

In [102]:
# df.head(20)

In [103]:
# df[df['subreddit'] == 'nfl'].sort_values(by='score', ascending=False).head(20)