<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">


# DSI-SG-42 Project 3: Web APIs & NLP
### Reddit Scams: Are We Vulnerable?
---

## 2. Data Cleaning

### 2.1 Importing Libraries

In [1]:
import pandas as pd
import numpy as np
import re

### 2.2 Import scraped dataset

In [2]:
df = pd.read_csv('../data/reddit_scraped_data.csv')
df.head()

Unnamed: 0,subreddit,type,text,id,url
0,RandomKindness,title,[Request] Today is My Birthday. I just want so...,1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
1,RandomKindness,body,I don’t normally do things like this. Today I ...,1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
2,RandomKindness,comment,"You're doing amazing, and I'm proud of you! So...",1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
3,RandomKindness,reply,"This made me smile, thank you 💕",1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
4,RandomKindness,reply,I love this ♥️😆,1bkqx22,https://www.reddit.com/r/RandomKindness/commen...


### 2.3 First look at data

In [3]:
df.describe()

Unnamed: 0,subreddit,type,text,id,url
count,66481,66481,66456,66481,66481
unique,2,4,31841,1931,1931
top,RandomKindness,comment,[removed],1bgi54f,https://www.reddit.com/r/Scams/comments/1bgi54...
freq,37183,34201,4856,1351,1351


A quick look indicates that there were duplicate rows that resulted from the scrape. We will remove the duplicates.

### 2.4 Remove duplicate rows

In [4]:
# Drop duplicate rows
df1 = df.drop_duplicates(subset=['text'])
# To verify that there are no duplicate rows
df1.describe()

Unnamed: 0,subreddit,type,text,id,url
count,31842,31842,31841,31842,31842
unique,2,4,31841,1889,1889
top,RandomKindness,reply,[Request] Today is My Birthday. I just want so...,1bgnrzz,https://www.reddit.com/r/Scams/comments/1bgnrz...
freq,16974,14174,1,469,469


### 2.5 Check for null values

In [5]:
df1.isnull().sum()

subreddit    0
type         0
text         1
id           0
url          0
dtype: int64

In [6]:
print(df1.shape)

(31842, 5)


In [7]:
# Remove null value
df2 = df1.dropna(subset='text', axis=0)
# Verify that null value has been dropped
print(df2.shape)

(31841, 5)


In [8]:
df2.head(10)

Unnamed: 0,subreddit,type,text,id,url
0,RandomKindness,title,[Request] Today is My Birthday. I just want so...,1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
1,RandomKindness,body,I don’t normally do things like this. Today I ...,1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
2,RandomKindness,comment,"You're doing amazing, and I'm proud of you! So...",1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
3,RandomKindness,reply,"This made me smile, thank you 💕",1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
4,RandomKindness,reply,I love this ♥️😆,1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
5,RandomKindness,reply,Where did you find the virtual bubble wrap? Th...,1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
6,RandomKindness,reply,>!pop!!<>!pop!!<,1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
7,RandomKindness,reply,Yay I’m happy it’s your birthday cause now we ...,1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
8,RandomKindness,comment,Happy Birthday!! I'm glad you're still with us...,1bkqx22,https://www.reddit.com/r/RandomKindness/commen...
9,RandomKindness,comment,You do not sound like a loser at all. You soun...,1bkqx22,https://www.reddit.com/r/RandomKindness/commen...


#### 2.6 Remove auto-generated texts

We noticed that there were many comments/ replies that were auto-generated by bot. These do not serve much meaning to our analysis, hence we will remove these rows.

In [9]:
# Remove bot messages
df2 = df2[df2['text'].str.contains('I am a bot')==False]
df2.shape

(27395, 5)

#### 2.7 Remove special characters

There were a few other text which will not add value to our analysis. They are:

1. Texts which describe nature of text, usually found in title, for example [Request] or [Offer]
2. URLs within comments/ replies
3. /r which makes reference to a subreddit
4. Emojis and punctuations

We will use regex to replace the above characters with ' ', and then remove these rows.

In [10]:
# Function to remove square brackets, URLs, r/, emojis, punctuations using regex
def clean_text(text):
    # Check if the value is a string
    if isinstance(text, str):
        # Remove square brackets []
        text = re.sub(r'\[.*?\]', '', text)
        # Remove URLs
        text = re.sub(r'http\S', '', text)
        # Remove subreddits 'r/'
        text = re.sub(r'r\/w+', '', text)
        # Remove emojis and punctuations
        text = re.sub(r'[^\w\s]','', text)

    try:
        text = text.lower()
        text = text.strip()
    except AttributeError:
        pass
        
    return text

# Apply the function to the 'text' column
df2['cleaned_text'] = df2['text'].apply(clean_text)

In [11]:
# Drop rows with ' ' in 'cleaned_text'
df2 = df2[df2['cleaned_text']!= '']

#### 2.8 Assign Binary Labels for r/RandomKindness and r/Scams

In [12]:
# Label 0 for RandomKindness, 1 for Scams
df2['label'] = [1 if i=='Scams' else 0 for i in df2['subreddit']]
# Setting label column as integer
df2['label'].astype(int)

0        0
1        0
2        0
3        0
4        0
        ..
66468    1
66471    1
66472    1
66474    1
66475    1
Name: label, Length: 27335, dtype: int32

In [13]:
df2.shape

(27335, 7)

#### 2.9 Export data to CSV to proceed with EDA

In [14]:
df2.to_csv('../data/cleaned_data.csv', index=False)