<a href="https://colab.research.google.com/github/rawwong/DSI-Movie-Recommendation-System---NSDC-Data-Science-Projects/blob/main/DSI_Movie_Recommendation_System_NSDC_Data_Science_Projects.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<h1 align="center">
    NSDC Data Science Projects
</h1>
  
<h2 align="center">
    June 2025 Data Camp Project: Movie Recommendation System
</h2>
<h3 align="center">
    Name: Rachel Wong
</h3>


---
Reminders:


*   The DSI June 2025 Data Camp is restricted to Columbia DSI Students and will be held in an entirely virtual format starting on Friday, June 27th, 2025.

*   The June 2025 Data Camp is an optional synchronized project meant to support incoming Columbia University DSI students. Students will be completing the project *individually,* but alongside peers and mentors!

*   Look below for updates and important information that will help you build and complete your project.

*   Visit the NSDC's dedicated [Slack Channel](https://join.slack.com/t/nsdcorps/shared_invite/zt-1gkd4ibdz-wGeQOwt3LUVooZDIK4zDPw) (#dsi-movie-recs-project) to stay in touch with the community.

---

### **Please read before you begin your project**

**Instructions: Google Colab Notebooks:**

Google Colab is a free cloud service. It is a hosted Jupyter notebook service that requires no setup to use, while providing free access to computing resources. We will be using Google Colab for this project.

**In order to work within the Google Colab Notebook, please start by clicking on "File" and then "Save a copy in Drive." This will save a copy of the notebook in your personal Google Drive.**

Please rename the file to "Movie Recommendation - Your Full Name - Your Columbia UNI." Once this project is completed, you will be prompted to share your file with the National Student Data Corps (NSDC) Project Leaders and Mentors.

You can now start working on the project. :)

We'll be using Google Colab for this assignment. This is a Python Notebook environment built by Google that's free for everyone and comes with a nice UI out of the box. For a comprehensive guide, see Colab's official guide [here](https://colab.research.google.com/github/prites18/NoteNote/blob/master/Welcome_To_Colaboratory.ipynb).

Google Colab QuickStart Guide:
- Notebooks are made up of cells, cells can be either text or code cells. Click the +code or +text button at the top to create a new cell
- Text cells use a format called [Markdown](https://www.markdownguide.org/getting-started/). Cheatsheet is available [here](https://www.markdownguide.org/cheat-sheet/).
- Python code is run/executed in code cells. You can click the play button at the top left of a code block (sometimes hidden in the square brackets) to run the code in that cell. You an also hit shift+enter to run the cell that is currently selected. There is no concurrency since cells run one at a time, but you can queue up multiple cells.
- Each cell will run code individually but memory is shared across a notebook Runtime. You can think of a Runtime as a code session where everything you create and execute is temporarily stored. This means variables and functions are available between cells if you execute one cell before the other (physical ordering of cells does not matter). This also means that if you delete or change the name of something and re-execute the cell, the old data might still exist in the background. If things aren't making sense, you can always click Runtime -> restart runtime to start over.
- Runtimes will persist for a short period of time so you are safe if you lose connection or refresh the page but Google will shutdown a runtime after enough time has past. Everything that was printed out will remain on the page even if the runtime is disconnected.
- Google's Runtimes come preinstalled with all the core python libraries (math, rand, time, etc) as well as common data analysis libraries (numpy, pandas, scikitlearn, matplotlib). Simply run `import numpy as np` in a code cell to make it available.

# **Singular Value Decomposition**

---

# Crash Course on Singular Value Decomposition (SVD)

## Introduction

Singular Value Decomposition, or SVD, is a mathematical technique used in many fields such as signal processing, statistics, and machine learning, particularly in the context of recommendation systems. It's a method for decomposing a matrix into three other matrices that reveal its underlying structure. Through this project, we will learn how to build a movie recommendation system using an SVD.

## Basic Concepts

### Matrices
- **Matrix**: A rectangular array of numbers.
- **Dimension of a Matrix**: Given in the form of rows × columns.

### Decomposition
- **Decomposition**: Breaking down a complex matrix into simpler, understandable parts.

## What is SVD?

```
SVD breaks down any given matrix A into three separate matrices named U, Σ and V*
ie. A = UΣV*
```
Where the components are:
```
- A: Original matrix.
- U: Left singular vectors (orthogonal matrix).
- Σ: Diagonal matrix of singular values (non-negative).
- V*: Right singular vectors (conjugate transpose of V , an orthogonal matrix).
```




### How do the predictions work?

1. **Model Training**:
   - The SVD algorithm is first trained on a portion of the dataset, which includes user ratings for various movies.
   - During training, the model learns to associate certain patterns and characteristics of users and movies with specific rating behaviors.

2. **Latent Features Extraction**:
   - SVD decomposes the rating matrix into matrices representing latent features of users and movies.
   - These latent features capture underlying aspects that affect rating behavior but are not explicitly available in the data (like user preferences or movie characteristics).

3. **Making Predictions**:
   - Once the model is trained, it can predict ratings for user-movie pairs where the actual rating is unknown.
   - The prediction is essentially a dot product of the latent features of the user and the movie. It represents the estimated preference of the user for that particular movie based on the learned patterns.

4. **Example of a Prediction**:
   - Suppose we want to predict how user `U` would rate movie `M`.
   - The model uses the latent features it has learned for user `U` and movie `M` to compute a predicted rating.
   - This prediction is a numerical value, typically on the same scale as the original ratings (e.g., 1 to 5).

5. **Application**:
   - These predictions are used to recommend movies to users.
   - For example, the system can recommend movies that have the highest predicted ratings for a particular user.

6. **Handling New Users or Movies (Cold Start Problem)**:
   - One challenge is predicting ratings for new users or movies that have little to no rating history. This is known as the cold start problem.
   - Solutions might involve using content-based approaches or hybrid models that don't rely solely on historical rating data.

## Where do we use SVDs?

### Applications in Recommendation Systems

In recommendation systems, SVD is used to predict unknown preferences by decomposing a large matrix of user-item interactions into factors representing latent features. It helps in capturing the underlying patterns in the data.

### Process

1. **Matrix Creation**: Start with a matrix where rows represent users, columns represent items, and entries represent user ratings.
2. **Apply SVD**: Decompose this matrix using SVD.
3. **Latent Features**: The decomposition reveals latent features that explain observed ratings.
4. **Prediction**: Use the decomposed matrices to predict missing ratings.

### Advantages of an SVD
- Effective at uncovering latent features in the data.
- Reduces dimensionality, making computations more manageable.

### Limitations of an SVD
- Assumes linear relationships in data.
- Sensitive to missing data and outliers.

##The Dataset


#### Dataset being used : **Movielens 100k dataset**

- This specific dataset, often referred to as "ml-100k," contains 100,000 ratings from 943 users on 1,682 movies. The data was collected through the MovieLens website during the seven-month period from September 19th, 1997 to April 22nd, 1998.

- **Data Structure**: The dataset includes user ratings that range from 1 to 5. Additionally, it provides demographic information about the users (age, gender, occupation, etc.) and details about the movies (titles, genres).

- **Usage**: It's a standard dataset used for implementing and testing recommender systems. Its size is manageable, making it a popular choice for educational purposes and for initial experimentation with recommendation algorithms.

- **Significance**: The diversity in the dataset, both in terms of users and movie genres, provides a rich ground for analyzing different recommendation strategies, testing algorithms like SVD, and understanding user preferences and behavioral patterns.

This dataset is an excellent starting point for anyone looking to delve into the world of recommender systems and practice with real-world data.


Now, we will write some code to understand and explore the dataset.

##The Surprise Library

- In this project we will be using Python's [Surprise library](https://surpriselib.com/), which is specifically used for developing recommendation systems
- The library includes built-in datasets (like the Movielens 100k), algorithms (like SVD), splitting functions, grid search functions among others for model training



---



##Milestone 1: Setting up your Environment

**Step 1:** Install a compatible version of [NumPy](https://numpy.org/) below. Simply run the following code to set up the proper environment within your notebook.


In [None]:
#You may have to restart the session after running this cell.
#Run this only once, you can comment out this part of the code after.
!pip install "numpy<2.0"



In [None]:
#Run this only once, you can comment out this part of the code after.
!pip install surprise



In [None]:
#Importing necessary modules for this project
import pandas as pd
from surprise import Dataset
from surprise.model_selection import train_test_split

In [None]:
#Installing the dataset from pandas, run this only once, you can comment out this part of the code after.
!pip install pandas scikit-surprise





---



##Milestone 2: Data Importing and Exploration

**Step 1:** Upload your dataset. Fill in the blanks below to import the 'ml-100k' dataset. Import the columns: user, item, rating, and timestamp. Then, print the first 5 rows of your dataset using the `df.head` function.

In [None]:
data = Dataset.load_builtin('ml-100k')
df = pd.DataFrame(data.raw_ratings, columns=["user", "item", "rating", "timestamp"])

print(df.head())

Dataset ml-100k could not be found. Do you want to download it? [Y/n] Y
Trying to download dataset from https://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to /root/.surprise_data/ml-100k
  user item  rating  timestamp
0  196  242     3.0  881250949
1  186  302     3.0  891717742
2   22  377     1.0  878887116
3  244   51     2.0  880606923
4  166  346     1.0  886397596


We see the following columns:

* **User ID**: A unique identifier for the user who provided the rating.

* **Item ID (Movie ID)**: A unique identifier for the movie that was rated.

* **Rating:** The rating given to the movie by the user. In the MovieLens 100k dataset, these ratings are typically on a scale of 1 to 5.

* **Timestamp:** The time at which the rating was provided. The timestamp is usually in Unix time format, which counts seconds since the Unix epoch (January 1, 1970).



**Step 2:** Let's explore the dataset! Obtain a summary of the dataset using the `info` function.

In [None]:
#Code below this line:
df.info

**Step 3:** Next, let's describe the statistics of this dataset using the `describe` function. [Click here if you need a refresher on descriptive statistics](https://youtube.com/playlist?list=PLNs9ZO9jGtUBQfxw7YAmtZJPRiEpnwaNc&feature=shared)!

In [None]:
#Code below this line:
df.describe()

Unnamed: 0,rating
count,100000.0
mean,3.52986
std,1.125674
min,1.0
25%,3.0
50%,4.0
75%,4.0
max,5.0


**Concept Check:** List any 2 findings you may notice from your data exploration.


1.    Respond here!
2.    Respond here!





---



##Milestone 3: Data Preprocessing

Now, we will do some data preprocessing. This will include checking for missing values and converting timestamps to a readable format.



**Step 1:** Check the dataset for missing values. Learn more about [missing values here](https://www.geeksforgeeks.org/ml-handling-missing-values/).

In [None]:
print(df.isnull().sum())

user         0
item         0
rating       0
timestamp    0
dtype: int64


**Concept Check:** Are there missing values? How do you know?

>*  The function will output the number of missing (NaN) values per column. Since the value is 0 for each column, then the dataset contains no missing values.



**Step 2:** Convert the timestamp to a readable format using the `pd.to_datetime` function. Use seconds (s) as your unit. Need a refresher? [Check out this resource](https://www.geeksforgeeks.org/python/python-pandas-to_datetime/).

In [None]:
# Convert timestamp to a readable format
df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')
print(df.head())

  df['timestamp'] = pd.to_datetime(df['timestamp'], unit='s')


  user item  rating           timestamp
0  196  242     3.0 1997-12-04 15:55:49
1  186  302     3.0 1998-04-04 19:22:22
2   22  377     1.0 1997-11-07 07:18:36
3  244   51     2.0 1997-11-27 05:02:03
4  166  346     1.0 1998-02-02 05:33:16




---



##Milestone 4: Inspecting Training and Test Sets

In machine learning, a dataset is typically split into a **training set** and a **test set** to evaluate a model's performance. The training set is used to teach the model, while the test set is used to assess how well the model makes predictions on unseen data. To learn more, [explore this resource](https://www.w3schools.com/python/python_ml_train_test.asp).

**Step 1:** Split the data into a training set and a test set.

In [None]:
# Split the data into a training set and a test set
trainset, testset = train_test_split(data, test_size=0.20)

**Step 2:** Display the number of users and items in the training set.


In [None]:
# Display the number of users and items in the training set
print(f"Number of users: {trainset.n_users}")
print(f"Number of items: {trainset.n_items}")

Number of users: 943
Number of items: 1652


**Step 3:** Display the first few elements of the test set.

In [None]:
# Display the first few elements of the test set
print(testset[:5])

[('650', '290', 2.0), ('482', '311', 4.0), ('872', '1', 3.0), ('234', '989', 2.0), ('836', '260', 2.0)]




---



## Milestone 5: Hyperparameter Tuning and Model Training
Hyperparameter tuning is a critical step in optimizing the performance of an SVD model. The goal is to find the best combination of parameters that results in the most accurate predictions or lowest error rates.

#### Hyperparameters we will be tuning in this project:

1. **`n_factors`**:
   - Represents the number of latent factors (or features) to extract from the dataset.
   - The values can range between **`10 to 200`** to test the model's performance with a varying number of factors. A higher number of factors can capture more complex patterns but may lead to overfitting and increased computation time.

2. **`n_epochs`**:
   - Refers to the number of iterations over the entire dataset during training.
   - The values can range between **`10 to 50`** providing a range to evaluate whether more iterations improve model performance or lead to overtraining.

3. **`lr_all`** (Learning Rate):
   - Determines the step size at each iteration while moving toward a minimum of the loss function.
   - The values can range between **`0.001 to 0.01`** to test how fast the model learns. A smaller learning rate may lead to more precise convergence but requires more epochs.

4. **`reg_all`** (Regularization Term):
   - Helps prevent overfitting by penalizing larger model parameters.
   - The values can range between **`0.01 to 0.1`** offering a range to assess the impact of regularization on model performance. Higher regularization can reduce overfitting but may lead to underfitting.

</br>

**Step 1:** Using the information above, try **defining your own parameter grid** with approximately 2-3 values for each parameter. Test out different values within the range to see how the model changes!

In [None]:
# Here's an example of a grid of SVD hyperparameters for tuning
param_grid_example = {
    'n_factors': [50, 100, 150],
    'n_epochs': [20, 30],
    'lr_all': [0.005, 0.010],
    'reg_all': [0.02, 0.1]
}


In [None]:
# Define your own grid below of SVD hyperparameters for tuning. Use the above information for some helpful hints!

param_grid = {
    'n_factors': [90, 120, 151],
    'n_epochs': [27, 30],
    'lr_all': [0.006, 0.0235],
    'reg_all': [0.05, 0.094]
}

**Step 2:** Now, practice using Random Search for hyperparameter tuning.


In [None]:
#Define the distribution of SVD hyperparameters for tuning using Random Search. Feel free to tweak the intervals!
from scipy.stats import randint, uniform

param_dist = {
    'n_factors': randint(20, 201), #randint picks a random integer within the interval
    'n_epochs': randint(10, 51),
    'lr_all': uniform(0.001, 0.05), #uniform picks a random number within the interval
    'reg_all': uniform(0.0001, 0.1),
}

**Step 3:** Train the model with the following parameters. Review the concepts below to gain a deeper understanding of SVD.

1. **`SVD`**:
   - SVD is used to break down the user-item rating matrix into two smaller matrices - one that represents users and their preferences, and the other that represents items (in our case, movies!) and their characteristics.
   - Under the hood, the algorithm predicts a rating using the following equation: </br></br>
$$
\hat r_{u i} \;=\; \mu \;+\; b_{u} \;+\; b_{i} \;+\; q_{i}^{T} p_{u}
$$

where:  
- $\mu$ is the average rating across all users,  
- $b_{u}$ is the user bias,  
- $b_{i}$ is the item (movie) bias,  
- $p_{u}$ and $q_{i}$ are vectors that represent the user's and item's features.   </br></br>


The algorithm learns these values by minimizing prediction error on the training data (with a regularization term to avoid overfitting).

$$
\min_{b_u,\,b_i,\,p_u,\,q_i}\quad
\sum_{(u,i)\in R_{\text{train}}} \Bigl(r_{u i} \;-\; \hat r_{u i}\Bigr)^{2}
\;+\; \lambda\,\Bigl(b_{u}^{2} \;+\; b_{i}^{2} \;+\; \lVert p_{u}\rVert^{2} \;+\; \lVert q_{i}\rVert^{2}\Bigr)
$$

Here, $R_{\text{train}}$ is the set of observed $(u,i)$ pairs in the training data, and $\lambda$ is the regularization hyperparameter.


> **`High Level Idea`**
>
> *   Start with a matrix R , the observed data (which might or might not have missing values)
> *   Split the observed matrix into two component matrices (P and Q, these do not have missing values) by optimizing the above loss function
> *   Reconstruct the entries (observed and missing) of R.


2. **`param_grid`**:
   - It defines the grid of parameters that will be tested. You'll test different combinations of hyperparameters to see which ones work best.
   - Example: If `param_grid` is `{'n_factors': [50, 100], 'lr_all': [0.005, 0.01]}`, GridSearchCV will evaluate the SVD algorithm for all combinations of `n_factors` and `lr_all` from these lists.

3. **`measures=['RMSE', 'MAE']`**:
   - These are the performance metrics used to evaluate the algorithm.
   - `RMSE` stands for [Root Mean Square Error](https://www.sciencedirect.com/topics/engineering/root-mean-square-error), and `MAE` stands for [Mean Absolute Error](https://www.sciencedirect.com/topics/engineering/mean-absolute-error). Both are common metrics for evaluating the accuracy of prediction algorithms, with lower values indicating better performance.

4. **`cv=3`**:
   - This specifies the number of folds for [cross-validation](https://www.geeksforgeeks.org/machine-learning/cross-validation-machine-learning/).
   - In this context, `cv=3` means that a 3-fold cross-validation will be used. The dataset will be split into three parts: in each iteration, two parts will be used for training, and one part will be used for testing. This process repeats three times, each time with a different part used for testing.

**Step 4:** Perform Grid Search with cross-validation to find the best hyperparameters for our model. The following cells may take some time to run.

In [None]:
from surprise.model_selection import cross_validate, train_test_split, GridSearchCV
from surprise import SVD, Dataset, Reader, accuracy


# First, we create a GridSearchCV object with the model, parameter grid, metrics and number of folds as the arguments. Hint: review the content above if you're stuck!
gs = GridSearchCV(SVD, param_grid, measures=['RMSE', 'MAE'], cv=3)


#The fit method will train the model for every combination of hyperparameters using 3-fold cross validation.
gs.fit(data)

**Step 5:** Perform Random Search with cross-validation to find the best hyperparameter for our model.

In [None]:
from surprise.model_selection import RandomizedSearchCV

# Create RandomSearchCV object with same parameters. Two additional arguments include n_iter (number of hyperparameter combinations) and random_state (to ensure deterministic combinations). Set the random_state to 42.
rs = RandomizedSearchCV(SVD, param_dist, measures=['RMSE', 'MAE'], cv=3, n_iter=20, random_state=42)


#The fit method will train the model for every combination of hyperparameters using 3-fold cross validation.
rs.fit(data)

**Step 6:** Print the best RMSE score and parameters.

In [None]:
# Best score and parameters

print(f"Best RMSE: {gs.best_score['rmse']}")
print(f"Best parameters: {gs.best_params['rmse']}")

print(f"Best RMSE using Random Search: {rs.best_score['rmse']}")
print(f"Best parameters using Random Search: {rs.best_params['rmse']}")

Best RMSE: 0.922589019998891
Best parameters: {'n_factors': 151, 'n_epochs': 27, 'lr_all': 0.0235, 'reg_all': 0.094}
Best RMSE using Random Search: 0.9234425894780106
Best parameters using Random Search: {'lr_all': 0.023524962598477153, 'n_epochs': 27, 'n_factors': 151, 'reg_all': 0.09432017556848528}


**Step 7:** Now that you've tested different parameter combinations, use the best-performing model from your grid search. You can access it using the best_estimator attribute from your GridSearchCV object (gs) or the RandomSearchCV object (rs).

In [None]:
#Use the best model. Use the `best_estimator` function on gs/rs
algo = gs.best_estimator['rmse']

In [None]:
#Train and test split. Make sure test_size is 0.25
trainset, testset = train_test_split(data, test_size=0.25)

#Fit the trainset to train the model
algo.fit(trainset)

#Make predictions on the testset
predictions = algo.test(testset)

#Calculate and print RMSE on the predictions made
print(accuracy.rmse(predictions))

RMSE: 0.9112
0.9112179182092204


**Step 8:** To understand the concept of cross-validation further, try changing `cv=3` to `cv=5`. To do this, define a second GridSearchCV and RandomSearchCV model (gs2 and rs2) and fit it to the data. Then find the best RMSE and parameters.  Finally, respond to the **concept check** below. These cells may take some time to run!



In [None]:
gs2 = GridSearchCV(SVD, param_grid, measures=['RMSE', 'MAE'], cv=5)
gs2.fit(data)

rs2 = RandomizedSearchCV(SVD, param_dist, measures=['RMSE', 'MAE'], cv=5, n_iter=20, random_state=42) #Hint: use the code in Steps 4 & 5 to help if you're stuck! Simply fill in the blanks as we did previously. Remember to change cv to 5!
rs2.fit(data)

In [None]:
print(f"Best RMSE: {gs2.best_score['rmse']}")
print(f"Best parameters: {gs2.best_params['rmse']}")

print(f"Best RMSE: {rs2.best_score['rmse']}")
print(f"Best parameters: {rs2.best_params['rmse']}") #Hint: use the code in Step 6 to help if you're stuck.

Best RMSE: 0.913225665909964
Best parameters: {'n_factors': 151, 'n_epochs': 27, 'lr_all': 0.0235, 'reg_all': 0.094}
Best RMSE: 0.9133596445782564
Best parameters: {'lr_all': 0.023524962598477153, 'n_epochs': 27, 'n_factors': 151, 'reg_all': 0.09432017556848528}


In [None]:
#Define the best estimator, split the dataset and fit the training set to the model. Then, make predictions on the test set.
algo2 = rs.best_estimator['rmse']

In [None]:
trainset, testset = train_test_split(data, test_size=0.25) # idk
algo2.fit(trainset)
predictions2 = algo2.test(testset)
accuracy.rmse(predictions)

RMSE: 0.9158


0.9157836552935479

**Concept Check:** Report back on how the RMSE changes when changing `cv=3` to `cv=5`. Does it increase/decrease/stay the same?

>*  By increasing the number of cross-validation folds from 3 to 5, the best RMSE score decreased. This means that the RMSE improved when switching from 3-fold to 5-fold cross-validation, suggesting that the model generalizes better with more varied validation splits.

**Step 9:** Now we will use the first model to predict new user and item values.

In [None]:
#Predict rating for a user and item
user_id = '196'  # replace with a specific user ID
item_id = '302'  # replace with a specific item (movie) ID
predicted_rating = algo.predict(user_id, item_id)
print(f"Predicted rating for user {user_id} and item {item_id}: {predicted_rating.est}")

Predicted rating for user 196 and item 302: 3.895047049050553


In [None]:
# To inspect the predictions in detail, let's print the first 10 predictions made by the model
for idx, prediction in enumerate(predictions[:10]):
    print(f'Prediction {idx}: User {prediction.uid} and item {prediction.iid} has true rating {prediction.r_ui}, and the predicted rating is {prediction.est}')


Prediction 0: User 286 and item 1074 has true rating 4.0, and the predicted rating is 3.502103837845117
Prediction 1: User 276 and item 452 has true rating 3.0, and the predicted rating is 2.8437389791921817
Prediction 2: User 733 and item 125 has true rating 2.0, and the predicted rating is 2.914217992389738
Prediction 3: User 13 and item 59 has true rating 4.0, and the predicted rating is 4.3191918877872775
Prediction 4: User 474 and item 737 has true rating 4.0, and the predicted rating is 3.7387711568438435
Prediction 5: User 246 and item 1044 has true rating 1.0, and the predicted rating is 2.838956047495981
Prediction 6: User 903 and item 185 has true rating 5.0, and the predicted rating is 4.544981570680432
Prediction 7: User 95 and item 779 has true rating 3.0, and the predicted rating is 2.3908631107637754
Prediction 8: User 405 and item 1590 has true rating 1.0, and the predicted rating is 2.301079448226717
Prediction 9: User 457 and item 425 has true rating 4.0, and the pred



---



##Milestone 6: Rounding Numbers
Rounding values is a technique used to simplify numbers, but its appropriateness depends on the context.

**When to Round**
1. **Simplification**: For estimations.
2. **Reporting**: When exact figures aren't necessary (e.g., in everyday language).
3. **Data Analysis**: To focus on significant trends by ignoring minor variations.
4. **Financial Transactions**: Rounding to the smallest currency unit.
5. **Display Purposes**: For clarity in graphs or tables.

**When NOT to Round**
1. **Intermediate Calculations**: Early rounding can lead to significant final errors.
2. **Legal/Regulatory Documents**: Require exact figures.
3. **Scientific/Engineering Work**: Precision is crucial.
4. **Critical Calculations**: In health, safety, or finance, precision is essential.

To summarize,
- Rounding depends on the purpose and context of the calculation.
- It is useful for simplification and clarity but should be avoided when precision is critical.
- We must be aware of potential cumulative errors in sequential calculations.


Let us round the values of the predictions so that it falls within the rating categories of [1.0, 2.0, 3.0, 4.0, 5.0]

**Step 1:** Import math library for math.ceil

In [None]:
import math

**Step 2:** Round the prediction.est variable.

In [None]:
#TODO - Round the prediction.est variable being printed. Use python's default rounding function to achieve this

for idx, prediction in enumerate(predictions[:10]):
    temp = math.ceil(int(prediction.est))
    print(f'Prediction {idx}: User {prediction.uid} and item {prediction.iid} has true rating {prediction.r_ui}, and the predicted rating is {round(prediction.est)}')


Prediction 0: User 286 and item 1074 has true rating 4.0, and the predicted rating is 4
Prediction 1: User 276 and item 452 has true rating 3.0, and the predicted rating is 3
Prediction 2: User 733 and item 125 has true rating 2.0, and the predicted rating is 3
Prediction 3: User 13 and item 59 has true rating 4.0, and the predicted rating is 4
Prediction 4: User 474 and item 737 has true rating 4.0, and the predicted rating is 4
Prediction 5: User 246 and item 1044 has true rating 1.0, and the predicted rating is 3
Prediction 6: User 903 and item 185 has true rating 5.0, and the predicted rating is 5
Prediction 7: User 95 and item 779 has true rating 3.0, and the predicted rating is 2
Prediction 8: User 405 and item 1590 has true rating 1.0, and the predicted rating is 2
Prediction 9: User 457 and item 425 has true rating 4.0, and the predicted rating is 4




---



<h3 align = 'center' >
Thank you for completing the project!
</h3>

###**[Please submit all materials to the NSDC HQ team via this Google Form](https://docs.google.com/forms/d/e/1FAIpQLSfHQ2TYokDwXcR2X9s3_syFdZVi_gK7bvgFvjalxwTFzkBoLQ/viewform). **

You may submit your project at any time. The June 27th deadline is a recommendation only. The Project Team will send out certificates to participants no sooner than Thursday, July 3rd. Please be patient with our small team as we process everyone's submissions.

Participants who submit completed projects will receive a virtual certificate of completion. Do reach out to us if you have any questions or concerns at nsdc@columbia.edu. We are here to help you learn and grow.
