## Part 2 - Data Cleaning

After extracting the data from reddit, I will clean the datasets.

In [1]:
# Importing required libraries
import pandas as pd
import re

from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer

In [2]:
# Import datasets which were extracted
marriage = pd.read_csv('../datasets/Marriage_Advice_all_posts.csv')
relationship = pd.read_csv('../datasets/Relationship_all_posts.csv')

### Cleaning marriage dataset

In [3]:
marriage

Unnamed: 0,selftext,title,created_utc,id,score
0,I'm a guy in his late 20s from India who now h...,Dichotomy regarding marriage in my mind,1648105452,tlzzki,1
1,My husband and I got married 3 years into dati...,How do you know when to leave a marriage?,1648100511,tlyske,1
2,TL;DR So my wife goes to a gym everyday which ...,Am I the asshole to not pick her up from the g...,1648097876,tly41g,1
3,Been married for about 9 years and have two ki...,when is it enough?,1648090786,tlw21d,1
4,tl;dr We just had a tornado warning and watch ...,How to comfort your wife ?,1648080885,tlnkeq,1
...,...,...,...,...,...
9693,,Work to Do in Marriage Part 1,1491655674,646ul7,1
9694,,Do you think having a lover can save your marr...,1491477391,63s9xo,1
9695,,inter caste love marriage specialist,1491475682,63s5uy,1
9696,We've been married for 3 years. I discovered h...,Moving on after infidelity if he's not sorry...,1490366653,619bim,1


In [4]:
# Fill nan with an empty space instead.
marriage = marriage.fillna('')

In [5]:
marriage

Unnamed: 0,selftext,title,created_utc,id,score
0,I'm a guy in his late 20s from India who now h...,Dichotomy regarding marriage in my mind,1648105452,tlzzki,1
1,My husband and I got married 3 years into dati...,How do you know when to leave a marriage?,1648100511,tlyske,1
2,TL;DR So my wife goes to a gym everyday which ...,Am I the asshole to not pick her up from the g...,1648097876,tly41g,1
3,Been married for about 9 years and have two ki...,when is it enough?,1648090786,tlw21d,1
4,tl;dr We just had a tornado warning and watch ...,How to comfort your wife ?,1648080885,tlnkeq,1
...,...,...,...,...,...
9693,,Work to Do in Marriage Part 1,1491655674,646ul7,1
9694,,Do you think having a lover can save your marr...,1491477391,63s9xo,1
9695,,inter caste love marriage specialist,1491475682,63s5uy,1
9696,We've been married for 3 years. I discovered h...,Moving on after infidelity if he's not sorry...,1490366653,619bim,1


In [6]:
# Dropping score and id column as they will not be used
marriage.drop(axis=1, columns=['score', 'id'], inplace=True)

There is a need to remove posts which were deleted or removed by moderators. There is also a need to remove repeated posts by moderators to remind redditors on the rules of that subreddit.

In [8]:
# Creating a mask to remove removed, deleted and moderator posts.
to_remove = (
    ((marriage['selftext'] == '[removed]')) |
    ((marriage['selftext'] == '[deleted]'))| 
    (marriage['title'].str.contains('Weekly advice')) 
)

In [9]:
# Drop rows which were identified
marriage.drop(marriage[to_remove].index, inplace=True)

In [None]:
# Drop any other duplicated posts
marriage.drop_duplicates(inplace=True)

While cleaning the data, I have noticed that there are some other languages being used in this thread which will hinder the analysis. Hence I will remove non-ascii characters as well as html_markers.

In [10]:
marriage['title'] = marriage['title'].str.encode('ascii', 'ignore').str.decode('ascii')

In [11]:
marriage['selftext'] = marriage['selftext'].str.encode('ascii', 'ignore').str.decode('ascii')

In [12]:
# Creating a function to strip some identified html markers
def remove_html_markers(text):
    
    text = re.sub('xa0', '', str(text)).strip()
    text = re.sub('x200b', '', str(text)).strip()
    
    return text

In [13]:
# Applying function to the data sets
marriage['selftext'] = marriage['selftext'].map(remove_html_markers)
marriage['title'] = marriage['title'].map(remove_html_markers)

Creating extra features for further analysis

In [14]:
# Creating a columns made from the combined text of title and selftext
marriage['all_text'] = marriage['title'] + " " + marriage['selftext']

In [15]:
# Creating the columns to get the characther length of each posts
marriage['text_length'] = marriage['selftext'].map(lambda x: len(x))
marriage['title_length'] = marriage['title'].map(lambda x: len(x))
marriage['all_text_length'] = marriage['all_text'].map(lambda x: len(x))

In [16]:
# Labelling marriage as the target post
marriage['is_marriage'] = 1

In [18]:
# Reset index 
marriage.reset_index(drop=True, inplace=True)

In [19]:
marriage

Unnamed: 0,selftext,title,created_utc,all_text,text_length,title_length,all_text_length,is_marriage
0,I'm a guy in his late 20s from India who now h...,Dichotomy regarding marriage in my mind,1648105452,Dichotomy regarding marriage in my mind I'm a ...,983,39,1023,1
1,My husband and I got married 3 years into dati...,How do you know when to leave a marriage?,1648100511,How do you know when to leave a marriage? My h...,3131,41,3173,1
2,TL;DR So my wife goes to a gym everyday which ...,Am I the asshole to not pick her up from the g...,1648097876,Am I the asshole to not pick her up from the g...,1283,115,1399,1
3,Been married for about 9 years and have two ki...,when is it enough?,1648090786,when is it enough? Been married for about 9 ye...,627,18,646,1
4,tl;dr We just had a tornado warning and watch ...,How to comfort your wife ?,1648080885,How to comfort your wife ? tl;dr We just had a...,376,26,403,1
...,...,...,...,...,...,...,...,...
9308,,Work to Do in Marriage Part 1,1491655674,Work to Do in Marriage Part 1,0,29,30,1
9309,,Do you think having a lover can save your marr...,1491477391,Do you think having a lover can save your marr...,0,51,52,1
9310,,inter caste love marriage specialist,1491475682,inter caste love marriage specialist,0,36,37,1
9311,We've been married for 3 years. I discovered h...,Moving on after infidelity if he's not sorry...,1490366653,Moving on after infidelity if he's not sorry.....,1064,47,1112,1


After cleaning up the dataset, there are still 9313 posts.

### Cleaning relationship dataset

In [20]:
relationship

Unnamed: 0,selftext,title,created_utc,id,score
0,We've been friends for just over 6 years now a...,I (m24) am in love with my best friend (f25) a...,1648110833,tm189y,1
1,Man I’ve been seeing for almost a year now blo...,Blocked and then unblocked ???,1648110802,tm180g,1
2,[removed],What should I do and is this normal?,1648110616,tm16iy,1
3,\nTLDR: Husband and His Sisters are insinuatin...,Catty In Laws,1648110431,tm152l,1
4,\n\nI didn't know that he had liked me. We us...,My (24F) best friend ended a friendship with m...,1648110402,tm14sr,1
...,...,...,...,...,...
9990,[removed],HELP,1647741881,tiapj2,1
9991,My partner and I have been together for a few ...,My partner (26m) decided to change after I (23...,1647741840,tiap1y,1
9992,I am 26 F still living at home and I’ve been b...,Feeling lost - confused mother-daughter relati...,1647741802,tiaooo,1
9993,Me ( 22 F) and my boyfriend (21 M) have been d...,I am not really sure what I am doing...,1647741781,tiaog5,1


In [21]:
# Fill nan with an empty space instead.
relationship = relationship.fillna('')

In [22]:
relationship

Unnamed: 0,selftext,title,created_utc,id,score
0,We've been friends for just over 6 years now a...,I (m24) am in love with my best friend (f25) a...,1648110833,tm189y,1
1,Man I’ve been seeing for almost a year now blo...,Blocked and then unblocked ???,1648110802,tm180g,1
2,[removed],What should I do and is this normal?,1648110616,tm16iy,1
3,\nTLDR: Husband and His Sisters are insinuatin...,Catty In Laws,1648110431,tm152l,1
4,\n\nI didn't know that he had liked me. We us...,My (24F) best friend ended a friendship with m...,1648110402,tm14sr,1
...,...,...,...,...,...
9990,[removed],HELP,1647741881,tiapj2,1
9991,My partner and I have been together for a few ...,My partner (26m) decided to change after I (23...,1647741840,tiap1y,1
9992,I am 26 F still living at home and I’ve been b...,Feeling lost - confused mother-daughter relati...,1647741802,tiaooo,1
9993,Me ( 22 F) and my boyfriend (21 M) have been d...,I am not really sure what I am doing...,1647741781,tiaog5,1


In [23]:
# Dropping score and id column as they will not be used
relationship.drop(axis=1, columns=['score', 'id'], inplace=True)

There is a need to remove posts which were deleted or removed by moderators. I have also identified a post which was was a spam post and hence i will remove it as well.

In [25]:
# Creating a mask to remove removed, deleted and spam posts.
to_remove_relationship = (
    (relationship['selftext'] == '[removed]') | 
    (relationship['selftext'] == '[deleted]') |
    (relationship['selftext'].str.contains('lower surface'))
)

In [26]:
# Drop rows which were identified
relationship.drop(relationship[to_remove_relationship].index, inplace=True)

In [None]:
# Drop any other duplicated posts
relationship.drop_duplicates(inplace=True)

In [27]:
# Creating a function to strip some identified html markers
relationship['selftext'] = relationship['selftext'].map(remove_html_markers)
relationship['title'] = relationship['title'].map(remove_html_markers)

Creating extra features for further analysis

In [28]:
# Creating a columns made from the combined text of title and selftext
relationship['all_text'] = relationship['title'] + " " + relationship['selftext']

In [29]:
# Creating the columns to get the characther length of each posts
relationship['text_length'] = relationship['selftext'].map(lambda x: len(x))
relationship['title_length'] = relationship['title'].map(lambda x: len(x))
relationship['all_text_length'] = relationship['all_text'].map(lambda x: len(x))

In [32]:
# Labelling relationship as 0
relationship['is_marriage'] = 0

In [33]:
# Reset index 
relationship.reset_index(drop=True, inplace=True)

In [35]:
relationship

Unnamed: 0,selftext,title,created_utc,all_text,text_length,title_length,all_text_length,is_marriage
0,We've been friends for just over 6 years now a...,I (m24) am in love with my best friend (f25) a...,1648110833,I (m24) am in love with my best friend (f25) a...,1805,70,1876,0
1,Man I’ve been seeing for almost a year now blo...,Blocked and then unblocked ???,1648110802,Blocked and then unblocked ??? Man I’ve been s...,1072,30,1103,0
2,TLDR: Husband and His Sisters are insinuating ...,Catty In Laws,1648110431,Catty In Laws TLDR: Husband and His Sisters ar...,3149,13,3163,0
3,I didn't know that he had liked me. We used to...,My (24F) best friend ended a friendship with m...,1648110402,My (24F) best friend ended a friendship with m...,2072,113,2186,0
4,Let me start off by saying that me and my girl...,"Did I cheat in my LDR? I think so, and feel re...",1648110392,"Did I cheat in my LDR? I think so, and feel re...",3260,67,3328,0
...,...,...,...,...,...,...,...,...
8836,What’s your outlook on your bf liking other gi...,Bf liking other girl’s posts on Twitter?,1647741882,Bf liking other girl’s posts on Twitter? What’...,511,40,552,0
8837,My partner and I have been together for a few ...,My partner (26m) decided to change after I (23...,1647741840,My partner (26m) decided to change after I (23...,1469,119,1589,0
8838,I am 26 F still living at home and I’ve been b...,Feeling lost - confused mother-daughter relati...,1647741802,Feeling lost - confused mother-daughter relati...,1337,52,1390,0
8839,Me ( 22 F) and my boyfriend (21 M) have been d...,I am not really sure what I am doing...,1647741781,I am not really sure what I am doing... Me ( 2...,3143,39,3183,0


After cleaning up the dataset, there are still 8841 posts left.

### Add lemmatize column

Creating a column to lemmatize the words to remove some of the repeated words. This may help to increase the model accuracy as words such as {'he', 'she', 'in', 'and', 'to' and 'a'} does not help in identifying the models.

In [36]:
# Creating a stopword set to be used to lemmatize
stop_words1 = set(CountVectorizer(stop_words = 'english').get_stop_words())

def lemmatize(text):
    # Split each word for lemmatization
    words = text.split(' ')
    
    # Lemmatize words.
    lemmatizer = WordNetLemmatizer()
    remaining_words = [lemmatizer.lemmatize(w) for w in words if not w in stop_words1]
    
    return(" ".join(remaining_words))

In [37]:
# Creating a column for lemmatized words
marriage['lem_all_text'] = marriage['all_text'].apply(lemmatize)
relationship['lem_all_text'] = relationship['all_text'].apply(lemmatize)

### Combining the datasets

After removing unwanted characters and posts, as well as creating various columns for further analysis, I will now combine both datasets so that they may be used for modelling.

In [39]:
# Combine the two cleaned datasets
combined = pd.concat(
    objs=[relationship, marriage],
    axis=0,
)

In [40]:
# Reset index
combined.reset_index(drop=True, inplace=True)

In [41]:
# Check to ensure that the datasets are concatenated properly
combined

Unnamed: 0,selftext,title,created_utc,all_text,text_length,title_length,all_text_length,is_marriage,lem_all_text
0,We've been friends for just over 6 years now a...,I (m24) am in love with my best friend (f25) a...,1648110833,I (m24) am in love with my best friend (f25) a...,1805,70,1876,0,I (m24) love best friend (f25) I absolutely ha...
1,Man I’ve been seeing for almost a year now blo...,Blocked and then unblocked ???,1648110802,Blocked and then unblocked ??? Man I’ve been s...,1072,30,1103,0,Blocked unblocked ??? Man I’ve seeing year blo...
2,TLDR: Husband and His Sisters are insinuating ...,Catty In Laws,1648110431,Catty In Laws TLDR: Husband and His Sisters ar...,3149,13,3163,0,Catty In Laws TLDR: Husband His Sisters insinu...
3,I didn't know that he had liked me. We used to...,My (24F) best friend ended a friendship with m...,1648110402,My (24F) best friend ended a friendship with m...,2072,113,2186,0,My (24F) best friend ended friendship discover...
4,Let me start off by saying that me and my girl...,"Did I cheat in my LDR? I think so, and feel re...",1648110392,"Did I cheat in my LDR? I think so, and feel re...",3260,67,3328,0,"Did I cheat LDR? I think so, feel really guilt..."
...,...,...,...,...,...,...,...,...,...
18149,,Work to Do in Marriage Part 1,1491655674,Work to Do in Marriage Part 1,0,29,30,1,Work Do Marriage Part 1
18150,,Do you think having a lover can save your marr...,1491477391,Do you think having a lover can save your marr...,0,51,52,1,Do think having lover save marriage?
18151,,inter caste love marriage specialist,1491475682,inter caste love marriage specialist,0,36,37,1,inter caste love marriage specialist
18152,We've been married for 3 years. I discovered h...,Moving on after infidelity if he's not sorry...,1490366653,Moving on after infidelity if he's not sorry.....,1064,47,1112,1,Moving infidelity he's sorry... We've married ...


Save the cleaned datasets as CSV

In [43]:
relationship.to_csv('../datasets/relationship_cleaned.csv', index=False)

In [44]:
marriage.to_csv('../datasets/marriage_cleaned.csv', index=False)

In [45]:
combined.to_csv('../datasets/combined.csv', index=False)

### Proceed to next notebook for EDA.