# YouTube Watching Patterns Analysis (2023 vs. 2024)

## Introduction

This project analyzes personal YouTube watch history data to compare viewing patterns in 2023 and 2024. 
By exploring this dataset, we aim to uncover insights into:

- **Temporal Trends**: Differences in viewing activity by time of day, day of the week, and month between the two years.
- **Content Preferences**: Shifts in the types of videos watched, including genres and channels.
- **Engagement Patterns**: Changes in binge-watching behavior and diversity of content consumed.

### **Dataset**
- **Source**: Google Takeout YouTube watch history export.
- **Time Range**: February 14, 2023, to the present.
- **Fields Extracted**:
  - Timestamps: Date and time of each video watched.
  - Titles and URLs: Metadata for video classification.
  - Channel Names: Creators of the videos.

### **Research Question**
How have my YouTube watching patterns evolved between 2023 and 2024 in terms of temporal habits, content preferences, and engagement trends?

---

## Setup

Below are the necessary imports and setup steps for the project.



In [12]:
# Import required libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup
from wordcloud import WordCloud
import re

# Set display options for clarity
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 50)
plt.style.use('seaborn-darkgrid')  # Use a clean plotting style

# Create folder structure
os.makedirs("data/raw", exist_ok=True)
os.makedirs("data/processed", exist_ok=True)
os.makedirs("notebook", exist_ok=True)
os.makedirs("outputs", exist_ok=True)
os.makedirs("visualizations", exist_ok=True)

# Define directories for reference
DATA_DIR_RAW = "data/raw/"
DATA_DIR_PROCESSED = "data/processed/"
VISUAL_DIR = "visualizations/"
OUTPUT_DIR = "outputs/"

print("Project setup complete. Directory structure created.")


ModuleNotFoundError: No module named 'wordcloud'

In [3]:
import pandas as pd
from bs4 import BeautifulSoup
import re

# Step 1: Load the HTML file
file_path = "data/izleme_gecmisi.html"

with open(file_path, "r", encoding="utf-8") as file:
    soup = BeautifulSoup(file, "html.parser")

# Step 2: Extract data from the HTML
data = []
for entry in soup.find_all("div", class_="content-cell"):
    text = entry.get_text()
    url = entry.find("a")["href"] if entry.find("a") else None

    # Use regex to extract date and time from the text
    match = re.search(r"(\d{1,2} \w+ \d{4}, \d{1,2}:\d{2} [AP]M)", text)
    if match:
        timestamp = pd.to_datetime(match.group(1), format="%d %B %Y, %I:%M %p")
        title = entry.find("a").get_text() if entry.find("a") else "Unknown Title"
        channel = re.findall(r"(?<=by\s).+?", text)  # Extract channel names
        data.append([timestamp, title, channel[0] if channel else "Unknown", url])

# Step 3: Create DataFrame
columns = ["Timestamp", "Title", "Channel Name", "URL"]
youtube_df = pd.DataFrame(data, columns=columns)

# Step 4: Add Derived Features
youtube_df["Hour"] = youtube_df["Timestamp"].dt.hour
youtube_df["Day"] = youtube_df["Timestamp"].dt.day_name()
youtube_df["Month"] = youtube_df["Timestamp"].dt.month_name()
# Validate the DataFrame
print("\nDataset Preview:")
print(youtube_df.head(5))

print("\nDataset Summary:")
print(youtube_df.info())

print("\nCheck for Missing Values:")
print(youtube_df.isnull().sum())

# Check for duplicates
duplicate_rows = youtube_df.duplicated().sum()
print(f"\nNumber of duplicate rows: {duplicate_rows}")

if duplicate_rows > 0:
    youtube_df.drop_duplicates(inplace=True)
    print(f"Duplicates removed. Remaining rows: {len(youtube_df)}")


# Step 5: Save Cleaned Data
youtube_df.to_csv("notebook/data/youtube_cleaned.csv", index=False)

print("Data extraction and cleaning complete.")


KeyboardInterrupt: 