## This homework assesses your ability of building and deploying your applications. 


`start date: Nov 22nd 11:59 PM` <br>
`due date: Dec 5th 11:59 PM`

### Make sure you submitted your submissions to BrightSpace.

`Total credits:  63/53`

You're welcome to share your thoughts about the homework and the course materials here: https://forms.gle/Kd9AoUZwkMiF8Vx5A

# P1: File path related questions

In the following structure, `project`, `data`, and `scripts` are folder names. `my_notebook.ipynb` is the Jupyter Notebook you used to create these folders.

```python
project/
    data/
        file_name.csv
    scripts/
        script.py
my_notebook.ipynb  

```

`P1.1   3 pts` Suppose I'm currently at the same level as the project folder, meaning I am outside the project folder but within the same parent directory. I want to open the file_name.csv file using the following code written in my my_notebook.ipynb:

```python
df = pd.read_csv('file_name.csv')

```
Will this code work? If not, how should I revise it? 

No, it wont work since the above code will be looking for the 'file_name.csv' in the parent directory location and since the csv file doesn't exist there, it will just throw an error saying the file doesn't exist. To resolve this we can use   
```python
df = pd.read_csv('./project/data/file_name.csv')
```

This way, I'm providing the full relative path from the current working directory to the file

`p1.2 3 pts`  Suppose you are now in the `data` folder, and you want to re-write your `script.py` file using the `%%writefile` method while reflecting all your actions (e.g., cd commands, file path changes) within the `my_notebook.ipynb` file. Can you demonstrate how you would accomplish this?

We can use 
```python
%%writefile ../scripts/script.py 
```
to accomplish this. The reason why it works is because the first two dots(..) mean that we wanna go up one level(which is project folder). From here, scripts/ gets us into scripts folder and then scripts.py creates/overwrites it.

`p1.3 3 pts`  Suppose you are outside of the `project` folder but still within the same parent directory, and you want to re-write your `script.py` file using the `%%writefile` method while reflecting all your actions (e.g., cd commands, file path changes) within the `my_notebook.ipynb` file. Can you demonstrate how you would accomplish this? Assume you did everything within your `my_notebook.ipynb`. 

We can use 
```python
%%writefile project/scripts/script.py
```
This works because we are in the same directory as my_notebook.ipynb, we specify the relative path through the project folder structure and the path 'project/scripts/script.py' correctly navigates from our current location to the target file.

`p1.4 3 pts` Suppose you are outside of the project folder but still within the same parent directory, and you want to create a sub-folder named `test_data` under the `data` folder. Can you demonstrate how you would accomplish this? Assume you did everything within your `my_notebook.ipynb`. 

We can use 
```python
!mkdir project/data/test_data
```
This command works because we are in the same directory as my_notebook.ipynb, we use the relative path through the project structure and the path 'project/data/test_data' correctly specifies where to create the new folder. This creates the test_data subfolder inside the data directory while keeping all actions within the notebook.

`p1.5 5 pts` Suppose you are outside of the project folder but still within the same parent directory, and you want to import your script.py file as a module. Your `script.py` file as the following contents:

```python
df = pd.read_csv('file_name.csv')
```

You have used the following codes to do the import within your `my_notebook.ipynb` file:

```python
import script
```

What are the issues with the script.py file ifself and also the way to import it? Explain bellow and fix it yourself. 

There are two main issues:

1. Issue with script.py file:


- The path to file_name.csv is incorrect
- It needs the full relative path: '../data/file_name.csv' to access the CSV from the scripts folder
- pandas (pd) is not imported in the script


2. Issue with import statement:


- The import statement doesn't specify the correct path to the module
- We need to add the scripts directory to the Python path to import from it

Here's the corrected code:
- For script.py:
```python
import pandas as pd

df = pd.read_csv('../data/file_name.csv')
```
- For my_notebook.ipynb:
```python
from project.scripts import script
```

`p1.6 6 pts` Suppose you are outside of the project folder but still within the same parent directory, and you want to first import the class `clean_data`  from `script.py`, and then create an instance of the class. Your `script.py` file as the following contents:

```python
class clean_data:

    def __init__(self):
        df = pd.read_csv(`file_name.csv`)
```

You have used the following codes to import and create instance:

```python
import script

clean_data = clean_data()

```

What are the issues with the script.py file ifself(1 issue ) and also the way to import it (2 issues)  and create instance(1 issue) ? Explain bellow and fix it yourself. 

1. In script.py:


- Wrong quotation marks around file_name.csv (uses backticks `` instead of quotes '')
- Missing pandas import


2. In import statement:


- Doesn't specify correct path to module
- Should import the class specifically instead of entire module


3. In instance creation:


- Creates naming conflict by using same name for instance as class name

Here's the corrected code:

- For script.py:
```python
import pandas as pd
class clean_data:
    def __init__(self):    
        self.df = pd.read_csv('file_name.csv')
        
```
        

- For my_notebook.ipynb:
```python
from project.scripts.script import clean_data

data_cleaner = clean_data()
```

# P2. `30 pts` Coding challenges (Individual version)

For this question, you will build a movie recommendation system using K-Nearest Neighbors (KNN) and create a webpage interface with Streamlit. The Streamlit webpage should allow the user to input a movie name they have watched before, and based on that input, the system will recommend 4 similar movies. Additionally, you will deploy your Streamlit app on AWS EC2.

For this question, I will place fewer restrictions on the choice of data and features, allowing you to mimic real-world decision-making as data professionals. In previous assignments, I guided you step-by-step to teach you how to correctly code each small part. Now that you’ve gained those foundational skills, this assignment will focus more on exercising your ability to design and solve problems independently.

You are free to use any dataset you like for this assignment. The movie dataset from HW8 is sufficient, but you are welcome to explore and use other datasets if you prefer.

Your final submission should include:

- A screenshot of the Streamlit webpage displaying your app and its recommendations.
- The source code used to create the application.

`Bonus 10 pts` use a pre-trained LLM model to add a short description to the movie provided by the user. 


In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('movies.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,movieId,average_rating,title,genres,year
0,0,1,3.893708,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,1,2,3.251527,Jumanji (1995),Adventure|Children|Fantasy,1995
2,2,3,3.142028,Grumpier Old Men (1995),Comedy|Romance,1995
3,3,4,2.853547,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,4,5,3.058434,Father of the Bride Part II (1995),Comedy,1995


In [3]:
df.drop(columns='Unnamed: 0',inplace=True)
df.head()

Unnamed: 0,movieId,average_rating,title,genres,year
0,1,3.893708,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy,1995
1,2,3.251527,Jumanji (1995),Adventure|Children|Fantasy,1995
2,3,3.142028,Grumpier Old Men (1995),Comedy|Romance,1995
3,4,2.853547,Waiting to Exhale (1995),Comedy|Drama|Romance,1995
4,5,3.058434,Father of the Bride Part II (1995),Comedy,1995


In [4]:
# Convert the genres string to a list
df['genres'] = df['genres'].str.split('|')


In [5]:
df

Unnamed: 0,movieId,average_rating,title,genres,year
0,1,3.893708,Toy Story (1995),"[Adventure, Animation, Children, Comedy, Fantasy]",1995
1,2,3.251527,Jumanji (1995),"[Adventure, Children, Fantasy]",1995
2,3,3.142028,Grumpier Old Men (1995),"[Comedy, Romance]",1995
3,4,2.853547,Waiting to Exhale (1995),"[Comedy, Drama, Romance]",1995
4,5,3.058434,Father of the Bride Part II (1995),[Comedy],1995
...,...,...,...,...,...
59042,209157,1.500000,We (2018),[Drama],2018
59043,209159,3.000000,Window of the Soul (2001),[Documentary],2001
59044,209163,4.500000,Bad Poems (2018),"[Comedy, Drama]",2018
59045,209169,3.000000,A Girl Thing (2001),[(no genres listed)],2001


In [6]:
%%writefile app.py

import numpy as np
import pandas as pd
import streamlit as st

class MovieKNN:
    def __init__(self, k=4):
        self.k = k
    
    def prepare_features(self, df):
        # Convert genres if they're not already in list format
        if isinstance(df['genres'].iloc[0], str):
            df['genres'] = df['genres'].str.strip('[]').str.split(',')
            df['genres'] = df['genres'].apply(lambda x: [item.strip() for item in x])

        # Create feature matrix
        # 1. Year as numeric feature
        years = df['year'].values.reshape(-1, 1)

        # 2. Ratings as numeric feature
        ratings = df['average_rating'].values.reshape(-1, 1)

        # 3. Genre similarity (one-hot encoding)
        all_genres = set()
        for genres in df['genres']:
            all_genres.update(genres)

        genre_matrix = np.zeros((len(df), len(all_genres)))
        for i, genres in enumerate(df['genres']):
            for j, genre in enumerate(all_genres):
                if genre in genres:
                    genre_matrix[i, j] = 1

        # Combine features
        self.features = np.hstack([
            years / years.max(), 
            ratings / ratings.max(),  
            genre_matrix  
        ])

        self.movies = df
        return self
    
    def euclidean_distance(self, movie1, movie2):
        return np.sqrt(np.sum((movie1 - movie2) ** 2))
    
    def get_recommendations(self, movie_title):
        # Find the movie index
        movie_idx = self.movies[self.movies['title'] == movie_title].index[0]
        movie_features = self.features[movie_idx]
        
        # Calculate distances to all other movies
        distances = []
        for idx, features in enumerate(self.features):
            if idx != movie_idx:  # Skip the input movie
                dist = self.euclidean_distance(movie_features, features)
                distances.append((idx, dist))
        
        # Sort by distance and get top k
        distances.sort(key=lambda x: x[1])
        neighbors = distances[:self.k]
        
        # Get the recommended movie titles
        recommendations = [self.movies.iloc[idx]['title'] for idx, _ in neighbors]
        
        return recommendations
    


# Load and prepare data
df = pd.read_csv('movies.csv')
knn = MovieKNN(k=4)
knn.prepare_features(df)

st.title('Movie Recommendation System')

# Create text input for movie title
selected_movie = st.text_input('Enter a movie title you like:')

if st.button('Get Recommendations'):
    # Check if movie exists in database
    if selected_movie in df['title'].values:
        recommendations = knn.get_recommendations(selected_movie)
        
        st.write("### Based on your selection, we recommend:")
        for i, movie in enumerate(recommendations, 1):
            st.write(f"{i}. {movie}")
    else:
        st.error("Sorry, this movie is not in our database. Please try another movie.")



Overwriting app.py


# Bonus

In [9]:
%%writefile app2.py

import numpy as np
import pandas as pd
import streamlit as st
from transformers import pipeline

class MovieKNN:
    def __init__(self, k=4):
        self.k = k
    
    def prepare_features(self, df):
        # Convert genres if they're not already in list format
        if isinstance(df['genres'].iloc[0], str):
            df['genres'] = df['genres'].str.strip('[]').str.split(',')
            df['genres'] = df['genres'].apply(lambda x: [item.strip() for item in x])

        # Create feature matrix
        # 1. Year as numeric feature
        years = df['year'].values.reshape(-1, 1)

        # 2. Ratings as numeric feature
        ratings = df['average_rating'].values.reshape(-1, 1)

        # 3. Genre similarity (one-hot encoding)
        all_genres = set()
        for genres in df['genres']:
            all_genres.update(genres)

        genre_matrix = np.zeros((len(df), len(all_genres)))
        for i, genres in enumerate(df['genres']):
            for j, genre in enumerate(all_genres):
                if genre in genres:
                    genre_matrix[i, j] = 1

        # Combine features
        self.features = np.hstack([
            years / years.max(), 
            ratings / ratings.max(),  
            genre_matrix  
        ])

        self.movies = df
        return self
    
    def euclidean_distance(self, movie1, movie2):
        return np.sqrt(np.sum((movie1 - movie2) ** 2))
    
    def get_recommendations(self, movie_title):
        # Find the movie index
        movie_idx = self.movies[self.movies['title'] == movie_title].index[0]
        movie_features = self.features[movie_idx]
        
        # Calculate distances to all other movies
        distances = []
        for idx, features in enumerate(self.features):
            if idx != movie_idx:  # Skip the input movie
                dist = self.euclidean_distance(movie_features, features)
                distances.append((idx, dist))
        
        # Sort by distance and get top k
        distances.sort(key=lambda x: x[1])
        neighbors = distances[:self.k]
        
        # Get the recommended movie titles
        recommendations = [self.movies.iloc[idx]['title'] for idx, _ in neighbors]
        
        return recommendations
    


# Initialize the text generation model
generator = pipeline('text-generation', model='gpt2')

# Load and prepare data
df = pd.read_csv('movies.csv')
knn = MovieKNN(k=4)
knn.prepare_features(df)

st.title('Movie Recommendation System')

# Create text input for movie title
selected_movie = st.text_input('Enter a movie title you like:')

if st.button('Get Recommendations'):
    # Check if movie exists in database
    if selected_movie in df['title'].values:
        # Generate movie description
        prompt = f"A brief description of the movie {selected_movie}: "
        description = generator(prompt, max_length=100, num_return_sequences=1)[0]['generated_text']
        
        # Display movie description
        st.write("### Movie Description:")
        st.write(description)
        
        # Get and display recommendations
        recommendations = knn.get_recommendations(selected_movie)
        st.write("### Based on your selection, we recommend:")
        for i, movie in enumerate(recommendations, 1):
            st.write(f"{i}. {movie}")
    else:
        st.error("Sorry, this movie is not in our database. Please try another movie.")


Overwriting app2.py


# P2. `30 pts` Coding challenges (Group version)

For this assignment, you are allowed to collaborate with up to 2 classmates and turn it into a group project. You can choose any problem and dataset to work on, but your project must demonstrate 3 of the following skills. Note: You must include either 3 (Running the project in Docker) or 4 (Deploying the application on AWS EC2) as one of your chosen skills.

1. Creating a Python package.
2. Building a webpage using Streamlit with user input functionality.
3. Running the project in Docker (mandatory if 4 is not chosen).
4. Deploying the application on AWS EC2 (mandatory if 3 is not chosen).
5. Applying KNN to solve a problem.
6. Utilizing LLM models in your application.


In your submission, you must clearly list your teammates' names, and each team member must submit the assignment individually on Brightspace.

Your final submission should include:

- A screenshot of the Streamlit webpage displaying your app and its functionality.
- The source code for the application.

`Bonus 10 pts` 


Prepare a roughly 5 minute presentation to deliver in class. If you plan to present, please notify me at least 2 days before the deadline to allow sufficient time for adjustments to the course schedule. Bonus points will be awarded after the presentation.