# Exploratory data analysis on Youtube search and watch history

Veera Määttänen 5.8.2023

The idea behind this project is to gain more knowledge on data analysis and using Python libraries and packages such as numpy, pandas, matplotlib and seaborn. 

- What do I want to obtain from the data? 
- What's the problem or situation that I am trying to solve or understand?
- Other questions?

#### 1. From html to csv files

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

import numpy as np

from datetime import datetime

In [19]:
with open("watch_history_1.html", encoding='utf8') as fp:
    soup = BeautifulSoup(fp, 'html.parser')

In [23]:
articles = soup.find_all('div', class_='content-cell mdl-cell mdl-cell--6-col mdl-typography--body-1')

I had to do some error handling here because the contents of the article were different for hidden and deleted videos. 

- First I tried to implement try except but it didn't work the way I wanted.
- I noticed that the lenght of the contents was 6 for all the videos that are still up and decided to use that as a condition. 
- There were two types of missing datas, deleted videos and hidden/private videos. Deleted videos didn't have any information about the channel or the video, but i wanted to keep them because they did have the date data and i am also interested in that. 
- Some of the hidden/private videos had working links and i was able to watch the videos, but i decided it would make more sense to just label these as hidden or private, and also suitable for this situation.

In [24]:
watch_history = []
for article in articles:
    if len(article.contents) == 6:
        video = article.find('a')
        title = video.string
        channel = title.find_next('a').string
        date = article.contents[-1]
        watch_history.append([title, channel, date])
    elif len(article.contents) == 3: # Deleted videos
        title = article.find_next(string=True)
        channel = float('nan')
        date = article.contents[-1]
        watch_history.append([title, channel, date])
    else: # Hidden or private videos
        title = 'Private/hidden video'
        channel = float('nan')
        date = article.contents[-1]
        watch_history.append([title, channel, date])

In [13]:
df = pd.DataFrame(watch_history, columns=['title','channel','date'])
df.to_csv('watch_history.csv')

NameError: name 'watch_history' is not defined

Going to open the watch_history.csv file in another dataframe so I won't have to run the BeautifulSoup each time (takes 30min lol)

In [3]:
df0 = pd.read_csv('watch_history.csv')

#### 2. 

In [4]:

df0.head()

Unnamed: 0,index,title,channel,date
0,0,6. tammikuuta 2022,Jyrki Hakkarainen,1.6.2023 klo 22.10.05 EEST
1,1,DATA ANALYST PORTFOLIO | 10 PROJECT IDEAS,Data With Mo,1.6.2023 klo 21.46.17 EEST
2,2,Q&A with a person who does not have an interna...,PA Struggles,1.6.2023 klo 18.49.48 EEST
3,3,Four Common MISTAKES To AVOID If You Want To L...,Doctor Youn,1.6.2023 klo 18.49.44 EEST
4,4,How Rainbolt Identifies Countries #geoguessr #...,Profoundly Pointless,1.6.2023 klo 17.23.28 EEST


In [5]:
df0.shape

(39441, 4)

In [6]:
df0.describe()

Unnamed: 0,index
count,39441.0
mean,19720.0
std,11385.78032
min,0.0
25%,9860.0
50%,19720.0
75%,29580.0
max,39440.0


My most watched channel is Pewdiepie and i have watched 1596 of his videos. This doesn't surprise me because I have watched him for a long time and absolutely love his content. 

In [7]:
df0.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 39441 entries, 0 to 39440
Data columns (total 4 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   index    39441 non-null  int64 
 1   title    39441 non-null  object
 2   channel  34126 non-null  object
 3   date     39441 non-null  object
dtypes: int64(1), object(3)
memory usage: 1.2+ MB


Next I want to split the date into months and years and also times. 

In [8]:
df0['date'][0]

'1.6.2023 klo 22.10.05 EEST'

In [9]:
# format pattern: %d.%m.%Y klo %H.%M.%S EEST

format = "%d.%m.%Y klo %H.%M.%S EEST"
test_date = df0['date'][0]
test = datetime.strptime(test_date, format)
print(test_date)
print(test)


1.6.2023 klo 22.10.05 EEST
2023-06-01 22:10:05


In [22]:
df0['date'] = pd.to_datetime(df0['date'], format='%d.%m.%Y klo %H.%M.%S EEST')
df0['date'] = df0['date'].dt.strftime('%Y-%m-%d %H:%M:%S')
df0.head()

Unnamed: 0,index,title,channel,date
0,0,6. tammikuuta 2022,Jyrki Hakkarainen,2023-06-01 22:10:05
1,1,DATA ANALYST PORTFOLIO | 10 PROJECT IDEAS,Data With Mo,2023-06-01 21:46:17
2,2,Q&A with a person who does not have an interna...,PA Struggles,2023-06-01 18:49:48
3,3,Four Common MISTAKES To AVOID If You Want To L...,Doctor Youn,2023-06-01 18:49:44
4,4,How Rainbolt Identifies Countries #geoguessr #...,Profoundly Pointless,2023-06-01 17:23:28


I think it would be interesting to see when i watch most and least videos. 

I want to add more columns based on the date, such as month, year, day and hour. 
This will enhance my ability to analyze the data and I will be able to use more methods and do comparisons. 

Let's add more columns to the dataframe.