<a href="https://colab.research.google.com/github/sydly148/hackthehood-build/blob/main/%5BBuild%5D_Week_7_Tiger_Teams_Teach_Me_How_to_Predict_III.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# üéµ **Week 7 Tiger Teams: Teach Me How to Predict III**

---

### üß† **Overview**
In this assignment, you‚Äôll apply what you‚Äôve learned about **machine learning** using the TikTok Songs 2020 dataset from Kaggle.  
You‚Äôll get hands-on experience building, testing, and improving models using **scikit-learn**.  

We‚Äôll focus on four core models we‚Äôve studied so far ‚Äî  
**Linear Regression**, **Logistic Regression**, **KMeans Clustering**, and **Decision Trees** ‚Äî  
but you‚Äôre also encouraged to explore other models if you‚Äôd like (e.g. Random Forest, SVM).  

By the end of this assignment, you‚Äôll:
- Prepare and explore real-world data  
- Train, test, and evaluate ML models  
- Measure model accuracy  
- Suggest ways to improve model performance  

### Reference & AI Usage Policy
You‚Äôre encouraged to reference online documentation and resources such as:  
- [W3Schools](https://www.w3schools.com/python/)  
- [Stack Overflow](https://stackoverflow.com/questions/tagged/python)  
- Official library docs such as [pandas.pydata.org](https://pandas.pydata.org/docs/)  
- [Scikit-Learn Documentation](https://scikit-learn.org/stable/api/index.html)

You may also use **AI tools** (e.g., ChatGPT, Copilot, or Gemini) **as learning aids**, not as shortcuts to complete the assignment.

If you use AI, you must clearly comment **where and how** you used it ‚Äî for example:  

```python
# Used ChatGPT to understand how to filter rows with .str.contains()



## üß© Part 1: Setup & Data Exploration

Run the cells below to import libraries and load the dataset.


In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.cluster import KMeans
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier
from sklearn.metrics import r2_score, mean_squared_error, accuracy_score, confusion_matrix, classification_report

Download the dataset from [this Kaggle link](https://www.kaggle.com/datasets/sveta151/tiktok-popular-songs-2020?resource=download). You will need to create a Kaggle account if you do not already have one. Run the cell below and upload the csv file to import it as a DataFrame.

In [None]:
from google.colab import files
uploaded = files.upload()

import io
df = pd.read_csv(io.BytesIO(uploaded['TikTok_songs_2020.csv']))

In [None]:
df.head()


### üëâ Answer the following questions using the cells below:
1. Display basic info about the dataset (`df.info()`)  
2. Check for missing values (`df.isnull().sum()`)  
3. Generate summary statistics (`df.describe()`)  
4. Look at column names ‚Äî which ones seem most relevant to predicting popularity?  



## üéõ Part 2: Data Preparation
For this assignment, we‚Äôll try to **predict the popularity** of TikTok songs based on their audio features.  

Let‚Äôs start by selecting some useful columns.


In [None]:
# Select relevant features
features = ['danceability', 'energy', 'loudness', 'speechiness',
            'acousticness', 'instrumentalness', 'liveness',
            'valence', 'tempo']

target = # TO DO

# Drop missing values
df = # TO DO

# Create feature matrix and target vector
X = # TO DO
y = # TO DO

# Split data into training and testing sets. Set your test_size to 20% of the data, and your random_state=42.
X_train, X_test, y_train, y_test = # TO DO


### üëâ Answer the following questions using the cells below:
1. Why is it important to split your data into **train** and **test** sets?  
2. Check the range of `y` (popularity). Are there any outliers?  
3. (Optional) Try normalizing your features using `StandardScaler` or `MinMaxScaler` from `sklearn.preprocessing`.  



## üßÆ Part 3: Build and Train Models
You can choose which model(s) you want to use for predicting popularity.  
Try at least **one regression model** and **one classification or clustering model**.  

In class, we've used linear regression, logistic regression, and decision trees. **You cannot use k-means clustering for this part, because Part 4 uses a k-means clustering model.** You are free to use other algorithms that we have not explored if you would like to. Refer back to past assignments if you need to!


### üéØ Model 1

In [None]:
model1 = # TO DO

# TO DO: fit your model on your training data


y_pred = model1.predict() # TO DO


# TO DO: If you chose to do a linear regression or clustering model, visualize it here. Make sure that you have an appropriate title, colors, and axis labels.






Use the cell below to output the accuracy of your first model. Use whatever metrics are appropriate for your chosen algorithm. You may use more than one accuracy metric.

### üéØ Model 2

In [None]:
model2 = # TO DO

# TO DO: fit your model on your training data


y_pred_2 = model2.predict() # TO DO


# TO DO: If you chose to do a linear regression or clustering model, visualize it here. Make sure that you have an appropriate title, colors, and axis labels.






Use the cell below to output the accuracy of your second model. Use whatever metrics are appropriate for your chosen algorithm. You may use more than one accuracy metric.


### üëâ Answer the following questions using the cells below:
1. Compare the accuracy of your 2 models, and explain why you chose the accuracy metrics that you used.  
2. Which one performs better? Why might that be?  
3. (Optional) Try changing hyperparameters (like `max_depth`) and see how the accuracy changes.  



## üéß Part 4: K-Means Clustering (Group Songs by Features)
Let‚Äôs see if we can find **natural clusters** among songs.


In [None]:
# TO DO: choose the number of clusters to use for your model
kmeans = KMeans(n_clusters= __, random_state=42)
kmeans.fit(X)

df['Cluster'] = kmeans.labels_

sns.pairplot(df, vars=['danceability', 'energy', 'valence'], hue='Cluster', palette='Set2')
plt.suptitle("K-Means Clustering of TikTok Songs", y=1.02)
plt.show()



### üëâ Answer the following questions using the cell below:
1. How many clusters did you choose, and why?  
2. What do the clusters seem to represent (e.g., energetic songs vs chill songs)?  
3. Try changing `n_clusters` ‚Äî how does it affect the grouping?  



## üß† Part 5: Evaluate & Improve

**Answer the following questions in the cell below:**
1. Which model gave you the highest accuracy or best R¬≤ score?  
2. What steps might improve it? (e.g. feature selection, parameter tuning, scaling, new model)  
3. How could you improve the other models?
4. What are some things to be wary of while trying to improve your models?
5. Summarize and explain the results of your work as if you were presenting it to a non-technical audience.




## üèÅ Wrap-Up
In this lab, you:
- Practiced building, training, and testing ML models  
- Measured model accuracy and visualized performance  
- Explored both prediction and clustering tasks  
- Considered how to improve your model  

Machine learning isn‚Äôt just about getting a number ‚Äî it‚Äôs about **understanding your data** and **iterating** to make your models smarter.
