Rename the file before submitting, you should change the filename to your first and last name

# 📝 Python Final Exam (40 Points)

## **General Guidelines:**  
- This is a closed-book exam. **Internet searches are NOT allowed!**  
- You must complete **all questions** in this Jupyter Notebook.  
- **Plagiarism is strictly prohibited.** Any identical or highly similar submissions will be flagged.  
- You are only allowed to use functions and concepts that were covered in the lectures. Using any functions or libraries beyond what was taught will result in a deduction of points.  
- The total time for this exam is **3 hours**. Manage your time wisely.  
- Your code should be clear, structured, and follow good programming practices.  

## **Submission Instructions:**  
- **Explain your thought process before coding** for each question. Include:  
  - Your step-by-step approach to solving the problem.  
  - Any assumptions or constraints.  
- <span style="color:red">**No explanation = 0 score.**</span>  

**Example Answer Format:**  
Question: Write a function that sums all even numbers in a list.  

Explanation (Before Coding):  
- Define a function to take a list as input.  
- Iterate through the list, checking if each number is even (divisible by 2).  
- Add even numbers to a running sum and return the total.  

### Melody Archive

You are a music curator at the "Melody Archive," and your task is to organize and analyze data about the songs in the collection. You will use Python to solve the following challenges, such as categorizing songs by genre, calculating average song lengths, and identifying the most popular artists. </br>
<span style="color:red">Print the results of each tasks</span>  

## Problem 1: Data Manipulation and Analysis Using Pandas (13 points)

This problem evaluates students' proficiency in manipulating and analyzing data using the Pandas library. It encompasses a comprehensive range of essential DataFrame operations.

1. **Create a Pandas Series**  
   Generate a Pandas Series containing your personal favorite song titles.

2. **Create a DataFrame with Artist Information**  
   Using the Series created in Task 1, construct a DataFrame with an additional column named `artist` that includes the corresponding artists for each song.

3. **Create a DataFrame from a List of Dictionaries**  
   Construct a DataFrame from a list of dictionaries, where each dictionary contains the following keys: `title`, `artist`, `genre`, `year`, `duration`, and `popularity`. Populate this DataFrame with your personal favorite song titles.

4. **Add a New Column**  
   Introduce a new column of your choice to the DataFrame created in Task 3. Ensure the column is relevant to the songs in the DataFrame.

5. **Retrieve and Print Specific Columns**  
   From the DataFrame created in Task 3, extract and print the first 10 rows, displaying only the `title` and `artist` columns.

6. **Filter and Display Specific Rows**  
   Filter the DataFrame to include only rows where the `genre` is "Pop". Print the resulting rows, displaying only the `title` and `year` columns.

7. **Retrieve Unique Genre Values**  
   From your DataFrame, extract and print the unique values present in the `genre` column.

8. **Filter Songs by Release Year**  
   Filter your DataFrame to include only songs released after the year 2000. Print the resulting DataFrame.

9. **Display DataFrame Information**  
   Use DataFrame functions to display the following information:  
   - Dimensions of the DataFrame  
   - Summary statistics for numerical columns  
   - Mean duration of all songs

10. **Update the DataFrame**  
    Copy your DataFrame using `df.copy()` and perform the following operations on the new DataFrame:  
    - Sort the DataFrame by `popularity` in descending order.  
    - Remove the `year` column.  
    - Rename the `duration` column to `length_seconds`.

11. **Group by Genre and Calculate Average Popularity**  
    Group the DataFrame by `genre` and calculate the average `popularity` for each genre. Print the result.

12. **Concatenate DataFrames**  
    Concatenate your DataFrame with the provided `additional_songs` DataFrame into a new DataFrame named `full_library`. Print the resulting DataFrame.

13. **Merge DataFrames**  
    Create another DataFrame containing your song titles and their corresponding ratings. Merge this DataFrame with `full_library` on the `title` column. Print the resulting DataFrame.

In [None]:
import pandas as pd 

additional_songs = pd.DataFrame({
    "title": ["Hotel California", "Uptown Funk"],
    "artist": ["Eagles", "Mark Ronson ft. Bruno Mars"] ,
    "genre": ["Rock", "Pop"] ,
    "year": [1976, 2014] ,
    "duration": [391, 270] ,
    "popularity": [94, 97]
})

In [None]:
# Write your code here

## Problem 2: Data Visualization (7 points)

This problem evaluates students' ability to create and customize visualizations using Matplotlib and Seaborn. The tasks focus on generating various types of plots to analyze and present data effectively.

1. **Line Chart: Popularity  of Songs per Genre**  
   Create a line chart to display the average popularity of songs in each genre. You can use the result of the task 11 from previous problem. Ensure the plot includes a title, x-label, and y-label.
   
2. **Bar Chart: Number of Songs per Genre**  
   Create a bar chart using Matplotlib to visualize the number of songs in each genre. Use the `genre` column for the x-axis and the count of songs for the y-axis. Add a title, x-label, and y-label to the plot.

3. **Histogram: Distribution of Song Durations**  
   Create a histogram to visualize the distribution of song durations. Use the `duration` column and set the number of bins. Include a title, x-label, and y-label in the plot.

4. **Scatter Plot: Song Duration vs. Popularity**  
   Create a scatter plot to explore the relationship between song duration and popularity. Add a title, x-label, and y-label to the plot.

5. **Pie Chart: Percentage of Songs per Genre**  
   Create a pie chart to show the percentage of songs in each genre. Use the `genre` column to group the data. Include a title and labels for each genre in the plot.

6. **Heatmap: Popularity of Songs by Artist**  
   Using Seaborn, create a heatmap to visualize the popularity of songs by each artist.  Add a title to the plot. You can use the following line to prepare data for plotting: </br> 
   `df.pivot(index='title', columns='artist', values='popularity')`

7. **Customize a Plot**  
   Choose any plot from Tasks 1–6 and customize it by:  
   - Changing the color scheme.  
   - Adding a grid.  
   - Adjusting the figure size.  
   - Adding annotations (e.g., highlighting the most popular song).
   - Choose any other parameters of your choice to change


In [None]:
# Write your code here

## Problem 3: Missing Data (10 points)

This section outlines tasks related to introducing, identifying, and handling missing values in a DataFrame. The tasks are designed to evaluate your ability to manipulate and clean data effectively.

In [None]:
# Introduce Missing values in Data
def introduce_missing_values(df, missing_fraction=0.5):
    np.random.seed(42)
    total_elements = df.size
    num_missing = int(total_elements * missing_fraction)
    indices = [(row, col) for row in range(df.shape[0]) for col in range(df.shape[1])]
    missing_indices = np.random.choice(len(indices), num_missing, replace=False)
    
    for i in missing_indices:
        row, col = indices[i]
        df.iat[row, col] = np.nan
    return df

Using the function provided above, apply it to the full_library DataFrame that you created in the first problem. Use result DataFrame to perform the following tasks

In [None]:
# Write your code here

1. **Introduce Missing Values**  
   Introduce missing values into your DataFrame by assigning `None` to some values in specific columns.

2. **Check for Missing Values**  
   Check if there are any missing values in your chosen column in the DataFrame.

3. **Check for Missing Values in the Entire DataFrame**  
   Check if there are any missing values present anywhere in the entire DataFrame.

4. **Count Missing Values per Column**  
   Count the number of missing values in each column of the DataFrame.

5. **Replace Missing Values with Mean**  
   Given the DataFrame, replace the missing values in the `popularity` column with the mean value of that column.

6. **Replace Missing Values with Median**  
   Using the same DataFrame, replace the missing values in the `duration` column with the median value of that column.

7. **Replace Missing Values with a Default String**  
   Given the DataFrame, replace all missing values in the `genre` column with the string `'Unknown'`.

8. **Drop Rows with Missing Values**  
   Identify and remove rows where the `artist` column contains missing values, and print the rows that were dropped as part of this operation.
   
9. **Backward Fill Missing Values**  
   Using the same DataFrame, apply method to backward fill the missing values in the `year` column.

10. **Identifying Missing Values in Everyday Scenarios**  </br>
   Provide three examples from your personal experiences or observations where missing or incomplete data occurs. Explain the context briefly and highlight how the absence of data impacts the situation.

In [None]:
# Write your code here

## Problem 4: Data Analysis (10 points) 


This section outlines a series of tasks designed to analyze, manipulate, and clean data in a DataFrame. The tasks focus on calculating statistics, grouping data, identifying outliers, generating random data, and cleaning text columns.


1. **Calculate Basic Statistics**  
   Compute basic statistics (mean, median, standard deviation, etc.) for the `duration` column. Using either NumPy, Statistics or Pandas library

2. **Group Data and Calculate Max Popularity**  
   Group the data by the `genre` column and calculate the max `popularity` for each group.
   
3. **Calculate Mean, Count, and Sum by Genre**  
    For each unique genre in the `genre` column, calculate the mean, count, and sum of the `duration` values.

4. **Create a Boxplot for Outlier Visualization**  
   Generate a boxplot for the `duration` column to visually identify outliers. Write the data points that are classified as outliers to a new DataFrame or output them for review.

5. **Create a Scatter Plot for Outlier Visualization**  
   Create a scatter plot of `year` vs. `popularity` to visually identify outliers. Write the data points that are classified as outliers to a new DataFrame or output them for review.
   
6. **Find and Remove Duplicate Rows**  
    Identify and print the number of duplicate rows in the DataFrame, then remove them to retain only unique entries.

7. **Find Unique Values in the Genre Column**  
   Extract and display all unique values in the `genre` column.

8. **Clean the Column**  
   Clean one of the object type column by converting all values to lowercase and removing any extra spaces.
   
9. **Replace a Value in a Column**  
    Select a column of your choice and replace one of its existing values with a new value of your specification.

10. **Clean the Title Column**  
   Remove any extra spaces or symbols (e.g., `!`, `?`, `.`) from the `title` column.

In [None]:
# Write your code here