# YouTube Data Processor
This script will process YouTube comments data for exploratory data analysis and natural language processing.
The comments data come from the top video results when the "body image" is searched on YouTube.
The data was scraped with [scraping_youtube_comments.ipynb](https://github.com/seungguini/body_image_conversations/blob/main/scraping_youtube_comments.ipynb),
which is built on [Selenium](https://www.selenium.dev/)

## References
The following processing methods refers to [NLP Part 2| Pre-Processing Text Data Using Python](https://towardsdatascience.com/preprocessing-text-data-using-python-576206753c28) by [Kamil Mysiak](https://medium.com/@kamilmysiak?source=post_page-----576206753c28--------------------------------) for guidance.

## Procedure
1. Importing Libraries along with our Data
2. Expanding Contractions
3. Language Detection
4. Tokenization
5. Converting all Characters to Lowercase
6. Removing Punctuations
7. Removing Stopwords
8. Parts of Speech Tagging
9. Lemmatization

### 1. Importing Libraries & Data

In [49]:
import pandas as pd
import numpy as np
import nltk
import string
# import fasttext
import contractions
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords, wordnet
from nltk.stem import WordNetLemmatizer

import matplotlib.pyplot as plt
plt.xticks(rotation=70)
pd.options.mode.chained_assignment = None
pd.set_option('display.max_colwidth',100)
%matplotlib inline

ModuleNotFoundError: No module named 'contractions'

In [31]:
with open ('./data/youtube_comments_bodyimage.csv',encoding='utf-8') as f:
    df = pd.read_csv(f)
f.close()

We can preview the data with `df.head()`

In [32]:
df.head(3)

Unnamed: 0,url,link_title,channel,no_of_views,time_uploaded,comment,author,comment_posted,no_of_replies,upvotes,downvotes
0,https://www.youtube.com/watch?v=5mP5RveA_tk,Girls Ages 6-18 Talk About Body Image | Allure,Allure,"5,014,874 views","Jun 1, 2018",skinny - “eat more”\nbig - “eat less” \nblack - “ugly”\nwhite - “basic”,hauntedxdreamss,,,9.6K,
1,https://www.youtube.com/watch?v=5mP5RveA_tk,Girls Ages 6-18 Talk About Body Image | Allure,Allure,"5,014,874 views","Jun 1, 2018",The fact that the 11 year old doesn’t like looking in the mirror at all proves something wrong,Madeleine burns,,,8.5K,
2,https://www.youtube.com/watch?v=5mP5RveA_tk,Girls Ages 6-18 Talk About Body Image | Allure,Allure,"5,014,874 views","Jun 1, 2018",I am sorry but did anyone else notice how beautiful these girls are???,Marlie Noto,,,7.5K,


First, let's drop the unnecessary columns. Seems like `comments_posted`, `no_of_replies`, and `downvotes` errored out,
so we'll drop those. Since we're also uninterested in who wrote the comments, we'll drop the author column too.

In [33]:
df.drop(['author','comment_posted','no_of_replies','downvotes'],axis=1,inplace=True)

In [36]:
df.head()

Unnamed: 0,url,link_title,channel,no_of_views,time_uploaded,comment,upvotes
0,https://www.youtube.com/watch?v=5mP5RveA_tk,Girls Ages 6-18 Talk About Body Image | Allure,Allure,"5,014,874 views","Jun 1, 2018",skinny - “eat more”\nbig - “eat less” \nblack - “ugly”\nwhite - “basic”,9.6K
1,https://www.youtube.com/watch?v=5mP5RveA_tk,Girls Ages 6-18 Talk About Body Image | Allure,Allure,"5,014,874 views","Jun 1, 2018",The fact that the 11 year old doesn’t like looking in the mirror at all proves something wrong,8.5K
2,https://www.youtube.com/watch?v=5mP5RveA_tk,Girls Ages 6-18 Talk About Body Image | Allure,Allure,"5,014,874 views","Jun 1, 2018",I am sorry but did anyone else notice how beautiful these girls are???,7.5K
3,https://www.youtube.com/watch?v=5mP5RveA_tk,Girls Ages 6-18 Talk About Body Image | Allure,Allure,"5,014,874 views","Jun 1, 2018","""It only takes a second to call a girl fat and She'll take a lifetime trying to starve herself.....",3.9K
4,https://www.youtube.com/watch?v=5mP5RveA_tk,Girls Ages 6-18 Talk About Body Image | Allure,Allure,"5,014,874 views","Jun 1, 2018","""I avoid looking in the mirror because if I do, I only think about how I want to be.""\n\n- 11 ye...",3.9K


Now, we check if any columns have missing values

In [38]:
for col in df.columns:
    print(col, df[col].isnull().sum())

url 0
link_title 0
channel 0
no_of_views 0
time_uploaded 0
comment 0
upvotes 81


Seems all of the columns have values, except for upvotes, when the comment has no likes.
We'll replace them with 0's

In [41]:
df.fillna(value=0, inplace=True)

In [43]:
# Check if there arny null values left
for col in df.columns:
    print(col, df[col].isnull().sum())

url 0
link_title 0
channel 0
no_of_views 0
time_uploaded 0
comment 0
upvotes 0


For the first section of the project, we'll simply conduct sentiment analysis on the comments, and compare the sentiments with the number of upvotes on the comments.

In [47]:
# Pull out comments and upvotes = [c]omments [u]pvotes [d]ata [f]rame
cudf = df.loc[:,['comment','upvotes']]
cudf.head()

Unnamed: 0,comment,upvotes
0,skinny - “eat more”\nbig - “eat less” \nblack - “ugly”\nwhite - “basic”,9.6K
1,The fact that the 11 year old doesn’t like looking in the mirror at all proves something wrong,8.5K
2,I am sorry but did anyone else notice how beautiful these girls are???,7.5K
3,"""It only takes a second to call a girl fat and She'll take a lifetime trying to starve herself.....",3.9K
4,"""I avoid looking in the mirror because if I do, I only think about how I want to be.""\n\n- 11 ye...",3.9K


### Processing Commments - Contractions