<h1 style="text-align:center; font-size:35px; color:black;">Web Traffic Analysis: Understanding User Interaction </h1>

# ðŸ”¬ Project Summary

This project analyzes website traffic data to better understand user interactions with different pages over a 7-day period. The dataset includes events like pageviews, previews, and clicks, along with information about where the traffic comes from and page content identifiers (isrc).

### This notebook is for data cleaning and preparation. 

It gets the raw dataset ready for analysis by:

* Checking for missing or incorrect values.
* Removing duplicate rows.
* Making sure categories like event are correct.
* Saving a clean dataset (traffic_cleaned.csv) for use in the analysis notebook.

In [1]:
import pandas as pd

In [2]:
# Load dataset
traffic = pd.read_csv("website_traffic.csv")

## ðŸ§¹ Data Cleaning and Exploration

* ### Basic structure & unique event types


In [3]:
print(traffic.shape)

traffic.head()

(226278, 9)


Unnamed: 0,event,date,country,city,artist,album,track,isrc,linkid
0,click,2021-08-21,Saudi Arabia,Jeddah,Tesher,Jalebi Baby,Jalebi Baby,QZNWQ2070741,2d896d31-97b6-4869-967b-1c5fb9cd4bb8
1,click,2021-08-21,Saudi Arabia,Jeddah,Tesher,Jalebi Baby,Jalebi Baby,QZNWQ2070741,2d896d31-97b6-4869-967b-1c5fb9cd4bb8
2,click,2021-08-21,India,Ludhiana,Reyanna Maria,So Pretty,So Pretty,USUM72100871,23199824-9cf5-4b98-942a-34965c3b0cc2
3,click,2021-08-21,France,Unknown,"Simone & Simaria, Sebastian Yatra",No Llores MÃ¡s,No Llores MÃ¡s,BRUM72003904,35573248-4e49-47c7-af80-08a960fa74cd
4,click,2021-08-21,Maldives,MalÃ©,Tesher,Jalebi Baby,Jalebi Baby,QZNWQ2070741,2d896d31-97b6-4869-967b-1c5fb9cd4bb8


In [4]:
traffic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 226278 entries, 0 to 226277
Data columns (total 9 columns):
 #   Column   Non-Null Count   Dtype 
---  ------   --------------   ----- 
 0   event    226278 non-null  object
 1   date     226278 non-null  object
 2   country  226267 non-null  object
 3   city     226267 non-null  object
 4   artist   226241 non-null  object
 5   album    226273 non-null  object
 6   track    226273 non-null  object
 7   isrc     219157 non-null  object
 8   linkid   226278 non-null  object
dtypes: object(9)
memory usage: 15.5+ MB


In [5]:
traffic.nunique()

event          3
date           7
country      211
city       11993
artist      2419
album       3254
track       3562
isrc         709
linkid      3839
dtype: int64

* ### Check for missing values

In [6]:
traffic.isnull().sum()

event         0
date          0
country      11
city         11
artist       37
album         5
track         5
isrc       7121
linkid        0
dtype: int64

In [7]:
# Fill missing values with 'Unknown'
traffic['country'] = traffic['country'].fillna('Unknown')
traffic['city'] = traffic['city'].fillna('Unknown')

traffic['artist'] = traffic['artist'].fillna('Unknown')
traffic['album'] = traffic['album'].fillna('Unknown')
traffic['track'] = traffic['track'].fillna('Unknown')

In [8]:
# check again after filling

traffic.isnull().sum()

event         0
date          0
country       0
city          0
artist        0
album         0
track         0
isrc       7121
linkid        0
dtype: int64

The `isrc` column has 7,121 missing values, which is about 3% of the data. It is only important for analysis that focuses on individual tracks, so it can be ignored for general link and event analysis.

* ### Check for Duplicate Rows

In [9]:
duplicates = traffic.duplicated().sum()
print("Duplicates:", duplicates)

Duplicates: 103711


Some rows appear repeated, but each row is considered a separate event, so duplicates are kept.

* ### Validate event categories

In [10]:
traffic['event'].unique()

array(['click', 'preview', 'pageview'], dtype=object)

* ### Validate date formatting & time coverage

In [11]:
traffic['date'] = pd.to_datetime(traffic['date'])
print(traffic['date'].min(), traffic['date'].max())

2021-08-19 00:00:00 2021-08-25 00:00:00


## Export cleaned dataset

In [12]:
traffic.to_csv('website_traffic_cleaned.csv', index=False)

print("Cleaned dataset saved as 'website_traffic_cleaned.csv'")

Cleaned dataset saved as 'website_traffic_cleaned.csv'


<div style="border-radius: 10px; border: purple solid; padding: 10px; background-color: #; font-size: 100%;">

## Feature Description

| Column Name | Description |
|-------------|-------------|
| `event`     | Type of user interaction recorded for the page. Can be `pageview`, `preview`, or `click`. |
| `date`      | Date when the event occurred. Format is `YYYY-MM-DD`. |
| `country`   | Country where the event originated. May be `NaN` if unknown. |
| `city`      | City where the event originated. May be `NaN` if unknown. |
| `artist`    | Artist associated with the page content. May be `NaN`. |
| `album`     | Album associated with the page content. May be `NaN`. |
| `track`     | Track name or page content title. May be `NaN`. |
| `isrc`      | Unique identifier for the track/content. May be `NaN`. |
| `linkid`    | Unique identifier for the page/link where the event occurred. |
