<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: Web API and NLP

--- 
# Contents


---

### Contents:
Notebook 2
- [Part3](#Part3)
    - Importing Library
    - Load CSV files
    - Data Cleaning
        - Function for removing https, special characters, terms and digits in the title and selftext
        - Understanding features
            - Whiskey
            - Rum
        - Feature engineering
            - Removing features for Whiskey
            - Removing features for Rum
            - Combine both Whiskey and Rum dataframe


--- 
# Part 3
Cleaning data

---

## 3.1 Importing library

In [3]:
import pandas as pd
import datetime as dt
import numpy as np
import re

In [4]:
pd.set_option('display.max_columns', 4000)
pd.set_option('display.max_rows', 4000)

## 3.2 Load csv files

In [5]:
whiskey = pd.read_csv('../datasets/whiskey.csv')
rum = pd.read_csv ('../datasets/rum.csv')

## 3.3 Data cleaning

In [6]:
#checking
whiskey.head(1)

Unnamed: 0,author,id,score,subreddit,title,selftext,num_comments,timestamp
0,metdthero,vhz4az,1,whiskey,Facts and Truth,,0,2022-06-22 15:11:17


In [7]:
#checking
rum.head(1)

Unnamed: 0,author,id,score,subreddit,title,selftext,num_comments,timestamp
0,TRFKTA,vi2mrr,1,rum,Out shopping for Rum and came across this. The...,,0,2022-06-22 19:05:31


### 3.3.1 Function for removing https, special characters, terms and digits in the title and selftext


In [93]:
#remove any URLs in selftext and title for whiskey and rum,
#found special characters and urls in later result

def regex_cleaning(row):
        
    # Remove links
    row['selftext'] = re.sub(
        pattern=r'\w+:\/\/[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', 
        repl='', 
        string=row['selftext'],
        flags=re.M)
    row['title'] = re.sub(
        pattern=r'\w+:\/\/[\d\w-]+(\.[\d\w-]+)*(?:(?:\/[^\s/]*))*', 
        repl='', 
        string=row['title'],
        flags=re.M)
    
    # Remove special terms    
    row['selftext'] = re.sub(
        pattern='#x200B;|&lt;|&gt;|&amp;|_',
        repl='',
        string=row['selftext'])
    row['title'] = re.sub(
        pattern='#x200B;|&lt;|&gt;|&amp;|_',
        repl='',
        string=row['title'])
    
    # Remove all digits
    row['selftext'] = re.sub(
        pattern=r'\d+',
        repl='',
        string=row['selftext'])
    row['title'] = re.sub(
        pattern=r'\d+',
        repl='',
        string=row['title'])
    
    # Remove anything that is not a word
    row['selftext'] = re.sub(
        pattern=r'\W+',
        repl=' ',
        string=row['selftext'])
    row['title'] = re.sub(
        pattern=r'\W+',
        repl=' ',
        string=row['title'])

    return row

### 3.3.2 Understanding features

In [94]:
#checking for null values
whiskey.isnull().sum()

author             0
id                 0
score              0
subreddit          0
title              0
selftext        6324
num_comments       0
timestamp          0
dtype: int64

In [95]:
rum.isnull().sum()

author             0
id                 0
score              0
subreddit          0
title              0
selftext        4683
num_comments       0
timestamp          0
dtype: int64

In [96]:
def valuecounts (dataframe):
    for column in dataframe:
        print(dataframe[column].value_counts())
        
def describe (dataframe):
    for column in dataframe:
        print(dataframe[column].describe())

#### 3.3.2.1 Whiskey

In [97]:
valuecounts(whiskey)

[deleted]           580
winetimes            38
petermal67           35
irish56_ak           31
Neversafeforlife     23
                   ... 
myballsaresweaty      1
WhiskyIsRisky         1
narambula25           1
doown_                1
Fishermichaels        1
Name: author, Length: 7289, dtype: int64
vhz4az    1
5wbxy1    1
5w6d2e    1
5w6k63    1
5w6kxb    1
         ..
dtpbtz    1
dtpots    1
dtpoyp    1
dtq79u    1
24qhyh    1
Name: id, Length: 9842, dtype: int64
1       4561
0        603
2        460
3        351
4        272
5        269
6        247
7        209
8        203
9        185
10       168
11       142
12       137
13       110
15        99
16        94
14        87
18        71
19        69
17        64
21        63
23        57
20        57
25        50
28        44
24        42
27        40
26        38
30        37
38        31
31        30
34        29
22        27
32        24
37        23
40        22
43        21
35        21
45        21
29        21
49    

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [98]:
whiskey['selftext'].value_counts(sort=True, ascending=False).head()

[deleted]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  339
[removed]                                                                                                                                                                                

In [99]:
whiskey['num_comments'].value_counts()

0      1811
1       761
2       618
4       596
3       559
5       551
6       481
7       421
8       386
9       376
10      334
11      274
12      241
13      196
14      194
15      185
16      174
18      147
17      135
20      107
19      103
21       90
22       87
23       83
26       70
24       56
25       55
27       54
28       44
31       41
30       40
33       36
34       35
29       32
32       31
36       29
37       28
38       27
35       19
42       18
41       16
43       15
40       15
39       13
50       11
46       11
49       11
45       10
47        9
57        9
44        8
59        8
53        8
52        7
54        6
60        6
66        6
61        6
65        6
55        6
81        5
48        5
63        5
75        5
62        5
67        5
79        4
58        4
68        4
96        4
70        4
51        4
56        3
77        3
78        3
104       3
84        3
91        3
72        3
76        3
105       2
71        2
114       2
64  

In [100]:
whiskey.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9842 entries, 0 to 9841
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   author        9842 non-null   object
 1   id            9842 non-null   object
 2   score         9842 non-null   int64 
 3   subreddit     9842 non-null   object
 4   title         9842 non-null   object
 5   selftext      3518 non-null   object
 6   num_comments  9842 non-null   int64 
 7   timestamp     9842 non-null   object
dtypes: int64(2), object(6)
memory usage: 615.2+ KB


In [101]:
describe(whiskey)

count          9842
unique         7289
top       [deleted]
freq            580
Name: author, dtype: object
count       9842
unique      9842
top       vhz4az
freq           1
Name: id, dtype: object
count    9842.000000
mean       13.606482
std        48.390404
min         0.000000
25%         1.000000
50%         1.000000
75%        10.000000
max      1729.000000
Name: score, dtype: float64
count        9842
unique          1
top       whiskey
freq         9842
Name: subreddit, dtype: object
count                 9842
unique                9711
top       Japanese whiskey
freq                     5
Name: title, dtype: object
count          3518
unique         3068
top       [deleted]
freq            339
Name: selftext, dtype: object
count    9842.000000
mean        9.681772
std        14.314444
min         0.000000
25%         1.000000
50%         6.000000
75%        12.000000
max       269.000000
Name: num_comments, dtype: float64
count                    9842
unique                 

In [102]:
whiskey.shape

(9842, 8)

#### 3.3.2.2 Rum

In [103]:
valuecounts(rum)

[deleted]           244
thefatrumpirate     124
anax44               99
LIFOanAccountant     86
t8ke                 79
                   ... 
marcoporras           1
LLbeejay              1
josmoize              1
MattChap              1
shardmonkey           1
Name: author, Length: 4150, dtype: int64
vi2mrr    1
7soi5q    1
7shegz    1
7sg24v    1
7sfpf0    1
         ..
geo4me    1
ge9yuj    1
ge9tr5    1
ge93u9    1
24o3yr    1
Name: id, Length: 7849, dtype: int64
1      4131
2       306
3       288
6       243
5       240
7       229
4       220
8       191
0       159
9       152
10      137
11      125
12      118
13      108
14       80
16       76
17       73
15       70
18       67
19       60
20       53
21       45
23       45
24       41
26       40
22       40
28       32
29       27
25       25
32       25
27       25
30       24
34       22
31       21
35       20
41       18
37       14
40       13
43       13
39       11
36       11
42       11
48       10
44       1

IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)



In [104]:
rum['selftext'].value_counts(sort=True, ascending=False).head()

[removed]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             348
[deleted]                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             138
Last night I had som

In [105]:
rum.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7849 entries, 0 to 7848
Data columns (total 8 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   author        7849 non-null   object
 1   id            7849 non-null   object
 2   score         7849 non-null   int64 
 3   subreddit     7849 non-null   object
 4   title         7849 non-null   object
 5   selftext      3166 non-null   object
 6   num_comments  7849 non-null   int64 
 7   timestamp     7849 non-null   object
dtypes: int64(2), object(6)
memory usage: 490.7+ KB


In [106]:
describe(rum)

count          7849
unique         4150
top       [deleted]
freq            244
Name: author, dtype: object
count       7849
unique      7849
top       vi2mrr
freq           1
Name: id, dtype: object
count    7849.000000
mean        6.953115
std        12.223031
min         0.000000
25%         1.000000
50%         1.000000
75%         8.000000
max       166.000000
Name: score, dtype: float64
count     7849
unique       1
top        rum
freq      7849
Name: subreddit, dtype: object
count                       7849
unique                      7696
top       Monthly Astor Pick Ups
freq                           4
Name: title, dtype: object
count          3166
unique         2677
top       [removed]
freq            348
Name: selftext, dtype: object
count    7849.000000
mean        8.670149
std        10.894061
min         0.000000
25%         1.000000
50%         6.000000
75%        12.000000
max       231.000000
Name: num_comments, dtype: float64
count                    7849
unique     

In [107]:
rum.shape

(7849, 8)

Feature key points:  
  
1.**author** is norminal feature, value delete occur if user deleted the account.  
author 590 deleted for whiskey and 246 deleted for rum.  
  
2.**id** is a norminal feature string attached to each user  
  
3.**score** is a discrete feature  
  
4.**title** is norminal feature  
  
5.**selftext** is a norminal feature  
there are removed and deleted in selftext
  
6.**num_comments** is the number of comments, discrete feature  
  
7.**timestamp** is a continuous data feature

### 3.3.3 Feature engineering

#### 3.3.3.1 Whiskey removal of columns

In [108]:
whiskey1 = whiskey[['title', 'selftext']]
whiskey1.head()

Unnamed: 0,title,selftext
0,Facts and Truth,
1,Dads old bottles.,"My father passed away in 2006, he had stopped ..."
2,Key West Bourbon Whiskey. Was it a mistake,
3,I picked up High West Double Rye Barrel Select...,
4,Review #32: Green Spot,


In [109]:
#remove duplicates
whiskey1.drop_duplicates(keep="first", inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  whiskey1.drop_duplicates(keep="first", inplace=True)


In [110]:
# remove all null, removed or deleted in dataframe, replace nan with empty string
whiskey1 = whiskey1.replace({"[removed]": np.NaN})
whiskey1 = whiskey1.replace({"[deleted]": np.NaN})
whiskey1 = whiskey1.replace(np.nan, "", regex=True)

In [111]:
#clean the titles and selftext before combining
whiskey1 = whiskey1.apply(regex_cleaning, axis=1)

In [112]:
#labelling each row in whiskey; 1 
whiskey1['origin'] = 1

In [113]:
#combine selftext and title to form a new column alltext
whiskey1['alltext'] = whiskey1['title'] + " " + whiskey1['selftext']

In [114]:
whiskey1.shape

(9765, 4)

#### 3.3.3.2 Rum removal of columns

In [115]:
rum1 = rum[['title', 'selftext']]
rum1.head()

Unnamed: 0,title,selftext
0,Out shopping for Rum and came across this. The...,
1,Most expensive bottle of rum I've personally c...,&amp;#x200B;\n\nhttps://preview.redd.it/rbvf9o...
2,Appleton Estate 21,"Hey all,\n\nI picked up two bottles of Appleto..."
3,"any liquor stores in Perth, WA that carry Dead...",I have a relative travelling home soon who cou...
4,Foursquare Sovereignty UK release?,Anyone any idea when this is releasing? Cause ...


In [116]:
#remove duplicates
rum1.drop_duplicates(keep="first", inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  rum1.drop_duplicates(keep="first", inplace=True)


In [117]:
# remove all null, removed or deleted in dataframe, replace nan with empty string
rum1 = rum1.replace({"[removed]": np.NaN})
rum1 = rum1.replace({"[deleted]": np.NaN})
rum1 = rum1.replace(np.nan, "", regex=True)

In [118]:
#clean the titles and selftext before combining
rum1 = rum1.apply(regex_cleaning, axis=1)

In [119]:
#labelling each row in rum; 2
rum1['origin'] = 2

In [120]:
#combine selftext and title to form a new column alltext
rum1['alltext'] = rum1['title'] + " " + rum1['selftext']

In [121]:
rum1.shape

(7762, 4)

#### 3.3.3.3 Combine both dataframes

In [122]:
#combine both dataframes together
whiskeyrum = pd.concat(objs=[whiskey1, rum1], axis=0)
#removal of duplicates found
whiskeyrum.drop_duplicates(subset=['selftext'], keep="first",inplace=True)
#reset index
whiskeyrum.reset_index(inplace=True, drop=True)
whiskeyrum.head(50)

Unnamed: 0,title,selftext,origin,alltext
0,Facts and Truth,,1,Facts and Truth
1,Dads old bottles,My father passed away in he had stopped drinki...,1,Dads old bottles My father passed away in he ...
2,Opinions on Singleton Luscious Nectar,I am thinking about buying this one its also o...,1,Opinions on Singleton Luscious Nectar I am th...
3,Classic Examples of Whiskey categories Suggest...,I ve been drinking whiskey for over decades bu...,1,Classic Examples of Whiskey categories Suggest...
4,Buffalo Trace good buy,Found some Buffalo Trace for ml here in Vegas ...,1,Buffalo Trace good buy Found some Buffalo Tra...
5,The madness continues,So I was at Total Wine picking up a bottle of ...,1,The madness continues So I was at Total Wine p...
6,Old Fashioned Cocktails Whiskey Sour,Processing video xiiurmcyb Classic Whiskey So...,1,Old Fashioned Cocktails Whiskey Sour Processi...
7,Ohio folks,Is it true that Ohio has some law agreement wi...,1,Ohio folks Is it true that Ohio has some law a...
8,Total Wine More Member Program,Not sure if this is the right place to ask thi...,1,Total Wine More Member Program Not sure if th...
9,Parting gift suggestions,Hey guys I m leaving work and moving on to som...,1,Parting gift suggestions Hey guys I m leaving...


In [123]:
whiskeyrum.shape

(5696, 4)

In [124]:
#saving to csv
whiskey1.to_csv('../datasets/whiskey1.csv', index=False)
rum1.to_csv('../datasets/rum1.csv', index=False)
whiskeyrum.to_csv('../datasets/whiskeyrum.csv', index=False)