# Movie User Segmentation with K-Means Clustering
## A Machine Learning Project Using Clustering Analysis

**Project Overview:**
- Dataset: MovieLens (100K ratings)
- Goal: Segment users into groups based on rating behavior
- Method: K-Means Clustering
- Focus: User behavior analysis and segmentation

---

## 1. Setup and Imports

**What does this step do?**

This step prepares our workspace by importing all the necessary Python libraries we'll need for the project:
- **Data handling**: `pandas` (for working with data tables) and `numpy` (for mathematical operations)
- **Visualization**: `matplotlib` and `seaborn` (for creating charts and graphs)
- **Machine Learning**: `KMeans` (for grouping users into clusters) and `StandardScaler` (for normalizing data)
- **Logging**: To keep track of what our code is doing and save results to a log file

We also set up a random seed (42) to ensure our results are reproducible every time we run the code.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import logging
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Machine Learning libraries
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

# Set random seed for reproducibility
np.random.seed(42)

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s',
    handlers=[
        logging.FileHandler('movie_clustering_project.log'),
        logging.StreamHandler()
    ]
)

logger = logging.getLogger(__name__)
logger.info('='*80)
logger.info('MOVIE USER SEGMENTATION PROJECT STARTED')
logger.info(f'Start Time: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}')
logger.info('='*80)

# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')

print('âœ“ All libraries imported successfully!')
logger.info('All libraries imported successfully')

## 2. Data Loading and Exploration

**What does this step do?**

Here we load the MovieLens dataset, which contains:
- **Ratings data**: Information about which users rated which movies and what rating they gave
- **Movies data**: Information about movie titles and genres

We then explore the data to understand:
- How many users, movies, and ratings are in our dataset
- The range of ratings (0.5 to 5.0)
- How "sparse" the data is (meaning most users haven't rated most movies)

**Why is this important?**
Understanding our data helps us know what we're working with and identify any potential issues before we start analysis.

In [5]:
logger.info('\n' + '='*80)
logger.info('PHASE 1: DATA LOADING AND EXPLORATION')
logger.info('='*80)

# Load datasets
try:
    ratings = pd.read_csv('ml-latest-small/ratings.csv')
    movies = pd.read_csv('ml-latest-small/movies.csv')
    logger.info(f'âœ“ Ratings dataset loaded: {ratings.shape[0]} rows, {ratings.shape[1]} columns')
    logger.info(f'âœ“ Movies dataset loaded: {movies.shape[0]} rows, {movies.shape[1]} columns')
    print(f'âœ“ Loaded {ratings.shape[0]:,} ratings for {movies.shape[0]:,} movies')
except Exception as e:
    logger.error(f'Error loading data: {str(e)}')
    raise

# Display first few rows
print('\n--- Ratings Data Sample ---')
print(ratings.head())
logger.info(f'Ratings columns: {list(ratings.columns)}')

print('\n--- Movies Data Sample ---')
print(movies.head())
logger.info(f'Movies columns: {list(movies.columns)}')

# Basic statistics
print('\n--- Dataset Statistics ---')
n_users = ratings['userId'].nunique()
n_movies = ratings['movieId'].nunique()
n_ratings = len(ratings)
sparsity = (1 - n_ratings / (n_users * n_movies)) * 100

print(f'Total Users: {n_users:,}')
print(f'Total Movies: {n_movies:,}')
print(f'Total Ratings: {n_ratings:,}')
print(f'Sparsity: {sparsity:.2f}%')
print(f'Rating Range: {ratings["rating"].min()} - {ratings["rating"].max()}')
print(f'Average Rating: {ratings["rating"].mean():.2f}')

logger.info(f'Total Users: {n_users}')
logger.info(f'Total Movies: {n_movies}')
logger.info(f'Total Ratings: {n_ratings}')
logger.info(f'Sparsity: {sparsity:.2f}%')
logger.info(f'Average Rating: {ratings["rating"].mean():.2f}')

2025-11-14 18:55:04,753 - INFO - 
2025-11-14 18:55:04,754 - INFO - PHASE 1: DATA LOADING AND EXPLORATION
--- Logging error ---
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.2800.0_x64__qbz5n2kfra8p0\Lib\logging\__init__.py", line 1163, in emit
    stream.write(msg + self.terminator)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.2800.0_x64__qbz5n2kfra8p0\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 33: character maps to <undefined>
Call stack:
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\cakypro\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Pytho

âœ“ Loaded 100,836 ratings for 9,742 movies

--- Ratings Data Sample ---
   userId  movieId  rating  timestamp
0       1        1     4.0  964982703
1       1        3     4.0  964981247
2       1        6     4.0  964982224
3       1       47     5.0  964983815
4       1       50     5.0  964982931

--- Movies Data Sample ---
   movieId                               title  \
0        1                    Toy Story (1995)   
1        2                      Jumanji (1995)   
2        3             Grumpier Old Men (1995)   
3        4            Waiting to Exhale (1995)   
4        5  Father of the Bride Part II (1995)   

                                        genres  
0  Adventure|Animation|Children|Comedy|Fantasy  
1                   Adventure|Children|Fantasy  
2                               Comedy|Romance  
3                         Comedy|Drama|Romance  
4                                       Comedy  

--- Dataset Statistics ---
Total Users: 610
Total Movies: 9,724
Total Ratin

## 3. Data Visualization

**What does this step do?**

We create four different charts to visually explore our data:

1. **Rating Distribution**: Shows how many ratings fall into each category (1-5 stars)
2. **Ratings per User**: Shows how active users are - some rate many movies, others rate just a few
3. **Ratings per Movie**: Shows how popular movies are - some movies get many ratings, others get few
4. **Average Rating per User**: Shows whether users tend to be generous raters or critical raters

**Why is this important?**
Visualizations help us quickly spot patterns in the data that would be hard to see in raw numbers. For example, we can see if most users give high ratings or if there are "super users" who rate hundreds of movies.

In [None]:
logger.info('\nCreating initial data visualizations...')

fig, axes = plt.subplots(2, 2, figsize=(15, 10))

# Rating distribution
axes[0, 0].hist(ratings['rating'], bins=10, edgecolor='black', color='steelblue')
axes[0, 0].set_title('Distribution of Ratings', fontsize=12, fontweight='bold')
axes[0, 0].set_xlabel('Rating')
axes[0, 0].set_ylabel('Count')
axes[0, 0].grid(True, alpha=0.3)
logger.info('âœ“ Rating distribution plot created')

# Ratings per user
ratings_per_user = ratings.groupby('userId').size()
axes[0, 1].hist(ratings_per_user, bins=50, edgecolor='black', color='coral')
axes[0, 1].set_title('Ratings per User', fontsize=12, fontweight='bold')
axes[0, 1].set_xlabel('Number of Ratings')
axes[0, 1].set_ylabel('Number of Users')
axes[0, 1].grid(True, alpha=0.3)
logger.info(f'âœ“ Ratings per user: min={ratings_per_user.min()}, max={ratings_per_user.max()}, mean={ratings_per_user.mean():.2f}')

# Ratings per movie
ratings_per_movie = ratings.groupby('movieId').size()
axes[1, 0].hist(ratings_per_movie, bins=50, edgecolor='black', color='lightgreen')
axes[1, 0].set_title('Ratings per Movie', fontsize=12, fontweight='bold')
axes[1, 0].set_xlabel('Number of Ratings')
axes[1, 0].set_ylabel('Number of Movies')
axes[1, 0].grid(True, alpha=0.3)
logger.info(f'âœ“ Ratings per movie: min={ratings_per_movie.min()}, max={ratings_per_movie.max()}, mean={ratings_per_movie.mean():.2f}')

# Average rating by user
avg_rating_per_user = ratings.groupby('userId')['rating'].mean()
axes[1, 1].hist(avg_rating_per_user, bins=30, edgecolor='black', color='plum')
axes[1, 1].set_title('Average Rating per User', fontsize=12, fontweight='bold')
axes[1, 1].set_xlabel('Average Rating')
axes[1, 1].set_ylabel('Number of Users')
axes[1, 1].grid(True, alpha=0.3)
logger.info(f'âœ“ Average rating per user: min={avg_rating_per_user.min():.2f}, max={avg_rating_per_user.max():.2f}, mean={avg_rating_per_user.mean():.2f}')

plt.tight_layout()
plt.savefig('data_exploration.png', dpi=300, bbox_inches='tight')
plt.show()
logger.info('âœ“ Data exploration plots saved as data_exploration.png')

print('\nâœ“ Data exploration complete!')

## 4. Feature Engineering for User Segmentation

**What does this step do?**

We create a summary profile for each user based on their rating behavior. For each user, we calculate:

- **num_ratings**: How many movies they've rated (activity level)
- **avg_rating**: Their average rating score (are they generous or critical?)
- **std_rating**: How much their ratings vary (do they give everything 5 stars, or do they vary?)
- **min_rating/max_rating**: The lowest and highest ratings they've given
- **rating_range**: The difference between their highest and lowest rating

**Why is this important?**
These features (characteristics) help us understand user behavior. Two users might both have rated 100 movies, but one might give everything 5 stars while the other varies between 1-5 stars. These patterns will help us group similar users together.

In [8]:
logger.info('\n' + '='*80)
logger.info('PHASE 2: FEATURE ENGINEERING FOR USER SEGMENTATION')
logger.info('='*80)

# Create user features for clustering
logger.info('Creating user features...')

user_features = ratings.groupby('userId').agg({
    'rating': ['count', 'mean', 'std', 'min', 'max'],
    'movieId': 'nunique'
}).reset_index()

user_features.columns = ['userId', 'num_ratings', 'avg_rating', 'std_rating', 'min_rating', 'max_rating', 'num_movies']
user_features['std_rating'].fillna(0, inplace=True)

# Calculate rating variance (spread)
user_features['rating_range'] = user_features['max_rating'] - user_features['min_rating']

logger.info(f'âœ“ User features created: {user_features.shape}')
logger.info(f'Features: {list(user_features.columns)}')

print('\n--- User Features Sample ---')
print(user_features.head())
print('\n--- User Features Statistics ---')
print(user_features.describe())

logger.info('\nUser feature statistics:')
for col in ['num_ratings', 'avg_rating', 'std_rating']:
    logger.info(f'{col}: mean={user_features[col].mean():.2f}, std={user_features[col].std():.2f}')

2025-11-14 18:55:22,835 - INFO - 
2025-11-14 18:55:22,836 - INFO - PHASE 2: FEATURE ENGINEERING FOR USER SEGMENTATION
2025-11-14 18:55:22,836 - INFO - Creating user features...
--- Logging error ---
Traceback (most recent call last):
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.2800.0_x64__qbz5n2kfra8p0\Lib\logging\__init__.py", line 1163, in emit
    stream.write(msg + self.terminator)
  File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.2800.0_x64__qbz5n2kfra8p0\Lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u2713' in position 33: character maps to <undefined>
Call stack:
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "C:\Users\cakypro\AppData\Local\Packages\PythonSoft


--- User Features Sample ---
   userId  num_ratings  avg_rating  std_rating  min_rating  max_rating  \
0       1          232    4.366379    0.800048         1.0         5.0   
1       2           29    3.948276    0.805615         2.0         5.0   
2       3           39    2.435897    2.090642         0.5         5.0   
3       4          216    3.555556    1.314204         1.0         5.0   
4       5           44    3.636364    0.990441         1.0         5.0   

   num_movies  rating_range  
0         232           4.0  
1          29           3.0  
2          39           4.5  
3         216           4.0  
4          44           4.0  

--- User Features Statistics ---
           userId  num_ratings  avg_rating  std_rating  min_rating  \
count  610.000000   610.000000  610.000000  610.000000  610.000000   
mean   305.500000   165.304918    3.657222    0.927116    1.314754   
std    176.236111   269.480584    0.480635    0.266108    0.835449   
min      1.000000    20.000000 

## 5. User Segmentation with K-Means Clustering

**What does this step do?**

This is the core of our project! We use an algorithm called **K-Means Clustering** to automatically group users into segments based on their behavior.

**The process:**
1. **Standardize features**: We scale all our features to the same range so one feature doesn't dominate others
2. **Elbow Method**: We test different numbers of clusters (2-10) to find the optimal number
3. **Apply K-Means**: We create 4 clusters and assign each user to their best-fitting group

**The 3 features we use for clustering:**
- Number of ratings (activity level)
- Average rating (sentiment)
- Standard deviation of ratings (consistency)

**Why is this important?**
Instead of treating all users the same, clustering helps us understand that different users have different behaviors. For example:
- Cluster 1: Casual users who rate few movies but are positive
- Cluster 2: Active users who rate many movies with varied opinions
- Cluster 3: Critical users who tend to give lower ratings
- Cluster 4: Super-users who rate hundreds of movies

In [None]:
logger.info('\n' + '='*80)
logger.info('PHASE 3: USER SEGMENTATION WITH K-MEANS CLUSTERING')
logger.info('='*80)

# Select features for clustering
clustering_features = ['num_ratings', 'avg_rating', 'std_rating']
X_cluster = user_features[clustering_features].copy()

logger.info(f'Clustering features: {clustering_features}')
logger.info(f'Feature matrix shape: {X_cluster.shape}')

# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_cluster)
logger.info('âœ“ Features standardized (mean=0, std=1)')
logger.info(f'Scaled data - mean: {X_scaled.mean(axis=0)}, std: {X_scaled.std(axis=0)}')

# Elbow method to find optimal K
logger.info('\nPerforming Elbow Method to find optimal K...')
wcss = []
K_range = range(2, 11)

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_scaled)
    wcss.append(kmeans.inertia_)
    logger.info(f'K={k}: WCSS={kmeans.inertia_:.2f}')

# Plot elbow curve
plt.figure(figsize=(10, 6))
plt.plot(K_range, wcss, 'bo-', linewidth=2, markersize=8)
plt.xlabel('Number of Clusters (K)', fontsize=12)
plt.ylabel('Within-Cluster Sum of Squares (WCSS)', fontsize=12)
plt.title('Elbow Method for Optimal K', fontsize=14, fontweight='bold')
plt.grid(True, alpha=0.3)
plt.savefig('elbow_method.png', dpi=300, bbox_inches='tight')
plt.show()
logger.info('âœ“ Elbow curve saved as elbow_method.png')

# Apply K-Means with optimal K
optimal_k = 4
logger.info(f'\nApplying K-Means with optimal K={optimal_k}')

kmeans_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
user_features['cluster'] = kmeans_final.fit_predict(X_scaled)

logger.info(f'âœ“ K-Means clustering complete')
logger.info(f'Inertia (WCSS): {kmeans_final.inertia_:.2f}')
logger.info(f'Number of iterations: {kmeans_final.n_iter_}')

# Cluster distribution
cluster_counts = user_features['cluster'].value_counts().sort_index()
print('\n--- Cluster Distribution ---')
print(cluster_counts)
logger.info('\nCluster distribution:')
for cluster, count in cluster_counts.items():
    logger.info(f'Cluster {cluster}: {count} users ({count/len(user_features)*100:.1f}%)')

# Cluster characteristics
print('\n--- Cluster Characteristics ---')
cluster_summary = user_features.groupby('cluster')[clustering_features].mean()
print(cluster_summary)

logger.info('\nCluster characteristics (mean values):')
for idx, row in cluster_summary.iterrows():
    logger.info(f'Cluster {idx}: num_ratings={row["num_ratings"]:.1f}, avg_rating={row["avg_rating"]:.2f}, std_rating={row["std_rating"]:.2f}')

print('\nâœ“ User segmentation complete!')

## 6. Visualize Clusters and Summary

**What does this step do?**

We create visualizations and interpretations of our clustering results:

**Visualizations:**
1. **User Clusters: Activity vs Rating Behavior**: A scatter plot showing how user activity relates to their average ratings, with each cluster shown in a different color. Black X marks show the center of each cluster.
2. **User Clusters: Rating Patterns**: Shows the relationship between average rating and rating consistency (standard deviation)

**Cluster Interpretation:**
For each cluster, we automatically interpret the user behavior patterns by looking at:
- **Activity Level**: High vs Low (compared to median)
- **Sentiment**: Positive (>3.5), Critical (<3.0), or Moderate (3.0-3.5)
- **Consistency**: Varied Tastes (high std dev) vs Consistent (low std dev)

**Project Summary:**
Finally, we display a complete summary of:
- Dataset statistics
- Clustering results
- Key findings
- Output files generated

**Why is this important?**
Visualizations make complex clustering results easy to understand at a glance. The interpretations help us understand what each cluster represents in real-world terms. For example, "Cluster 0: High Activity, Positive, Varied Tastes" tells us these are users who rate many movies, generally like them, but have diverse opinions.

In [None]:
logger.info('\nCreating cluster visualizations...')

fig, axes = plt.subplots(1, 2, figsize=(15, 6))

# Scatter plot: Number of ratings vs Average rating
for cluster in range(optimal_k):
    cluster_data = user_features[user_features['cluster'] == cluster]
    axes[0].scatter(cluster_data['num_ratings'], cluster_data['avg_rating'], 
                   label=f'Cluster {cluster}', alpha=0.6, s=50, edgecolors='k', linewidth=0.5)

# Plot centroids
centroids_original = scaler.inverse_transform(kmeans_final.cluster_centers_)
axes[0].scatter(centroids_original[:, 0], centroids_original[:, 1], 
               c='black', marker='X', s=300, edgecolors='yellow', linewidth=2, label='Centroids')
axes[0].set_xlabel('Number of Ratings', fontsize=12)
axes[0].set_ylabel('Average Rating', fontsize=12)
axes[0].set_title('User Clusters: Activity vs Rating Behavior', fontsize=12, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Scatter plot: Average rating vs Std rating
for cluster in range(optimal_k):
    cluster_data = user_features[user_features['cluster'] == cluster]
    axes[1].scatter(cluster_data['avg_rating'], cluster_data['std_rating'], 
                   label=f'Cluster {cluster}', alpha=0.6, s=50, edgecolors='k', linewidth=0.5)

axes[1].set_xlabel('Average Rating', fontsize=12)
axes[1].set_ylabel('Rating Std Dev', fontsize=12)
axes[1].set_title('User Clusters: Rating Patterns', fontsize=12, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('cluster_visualization.png', dpi=300, bbox_inches='tight')
plt.show()
logger.info('âœ“ Cluster visualizations saved as cluster_visualization.png')

# Interpret clusters
print('\n--- Cluster Interpretation ---')
logger.info('\nCluster interpretation:')
for cluster in range(optimal_k):
    cluster_data = user_features[user_features['cluster'] == cluster]
    avg_num_ratings = cluster_data['num_ratings'].mean()
    avg_rating = cluster_data['avg_rating'].mean()
    avg_std = cluster_data['std_rating'].mean()
    
    if avg_num_ratings > user_features['num_ratings'].median():
        activity = 'High Activity'
    else:
        activity = 'Low Activity'
    
    if avg_rating > 3.5:
        sentiment = 'Positive'
    elif avg_rating < 3.0:
        sentiment = 'Critical'
    else:
        sentiment = 'Moderate'
    
    if avg_std > user_features['std_rating'].median():
        consistency = 'Varied Tastes'
    else:
        consistency = 'Consistent'
    
    interpretation = f'{activity}, {sentiment}, {consistency}'
    print(f'Cluster {cluster}: {interpretation}')
    print(f'  - Avg ratings given: {avg_num_ratings:.1f}')
    print(f'  - Avg rating value: {avg_rating:.2f}')
    print(f'  - Avg std dev: {avg_std:.2f}\n')
    
    logger.info(f'Cluster {cluster}: {interpretation} (n={len(cluster_data)}, avg_ratings={avg_num_ratings:.1f}, avg_val={avg_rating:.2f}, std={avg_std:.2f})')

# Project Summary
logger.info('\n' + '='*80)
logger.info('PROJECT SUMMARY')
logger.info('='*80)

summary = f"""
========================================
PROJECT SUMMARY
========================================

1. DATASET:
   - Total Ratings: {n_ratings:,}
   - Total Users: {n_users:,}
   - Total Movies: {n_movies:,}
   - Sparsity: {sparsity:.2f}%

2. USER SEGMENTATION (K-Means Clustering):
   - Number of Clusters: {optimal_k}
   - Features Used: num_ratings, avg_rating, std_rating
   - Users successfully segmented into {optimal_k} distinct behavioral groups

3. KEY FINDINGS:
   - Users can be effectively segmented based on rating behavior
   - {optimal_k} distinct user segments identified with unique characteristics
   - Clustering reveals patterns in user activity and rating preferences

4. OUTPUTS GENERATED:
   - data_exploration.png
   - elbow_method.png
   - cluster_visualization.png
   - movie_clustering_project.log

========================================
"""

print(summary)
logger.info(summary)

logger.info('\n' + '='*80)
logger.info('PROJECT COMPLETED SUCCESSFULLY')
logger.info(f'End Time: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}')
logger.info('='*80)

print('\n' + '='*80)
print('âœ“ PROJECT COMPLETE!')
print('='*80)