<div style="border:solid green 2px; padding: 20px"> <h1 style="color:green; margin-bottom:20px">Reviewer's comment v1</h1>

Hello Justin, my name is Dmitrii. I'm going to review your project! Nice to meet you! 🙌

You can find my comments under the heading **«Review»**. I will categorize my comments in green, blue or red boxes like this:

<div class="alert alert-success">
    <b>Success:</b> if everything is done successfully
</div>
<div class="alert alert-warning">
    <b>Remarks:</b> if I can give some recommendations or ways to improve the project
   
</div>
<div class="alert alert-danger">
    <b>Needs fixing:</b> if the block requires some corrections. Work can't be accepted with the red comments
</div>

Please don't remove my comments :) If you have any questions don't hesitate to respond to my comments in a different section. 
<div class="alert alert-info"> <b>Student comments:</b> For example like this</div>   
   

<div style="border:solid green 2px; padding: 20px">
<b>Reviewer's comment v1:</b>
    
<b>Overall Feedback</b> 
    
- Overall well done! I can see that a lot of effort has been made! Your project looks very good and you accomplished impressive results.
- However, there are some comments/areas left to fix (in red boxes with the title - Reviewer's comment v1:). 
    
And of course, if you have any questions along the way, remember that you can always reach out to your tutor for any clarification.
    
I will wait for you to send me a new version of the project :)

    
</div>

<div style="border:solid green 2px; padding: 20px">
<b>Reviewer's comment v2:</b>
    
<b>Overall Feedback</b> 

Thank you for making all improvements in your project! It looks perfect now and there are no critical issues. 
    
Please keep up great work and don't hesitate to use this project as a reference in your future sprints. 
   
Good luck on the next project 🍀 
</div>

## Basic Python - Project <a id='intro'></a>

## Introduction <a id='intro'></a>
In this project, you will work with data from the entertainment industry. You will study a dataset with records on movies and shows. The research will focus on the "Golden Age" of television, which began in 1999 with the release of *The Sopranos* and is still ongoing.

The aim of this project is to investigate how the number of votes a title receives impacts its ratings. The assumption is that highly-rated shows (we will focus on TV shows, ignoring movies) released during the "Golden Age" of television also have the most votes.

### Stages 
Data on movies and shows is stored in the `/datasets/movies_and_shows.csv` file. There is no information about the quality of the data, so you will need to explore it before doing the analysis.

First, you'll evaluate the quality of the data and see whether its issues are significant. Then, during data preprocessing, you will try to account for the most critical problems.
 
Your project will consist of three stages:
 1. Data overview
 2. Data preprocessing
 3. Data analysis

<div class="alert alert-success">
<b>Reviewer's comment v1:</b>
    
It is always helpful for the reader to have additional information about project tasks. It gives an overview of what you are going to achieve in this project.


## Stage 1. Data overview <a id='data_review'></a>

Open and explore the data.

You'll need `pandas`, so import it.

In [1]:
import pandas as pd

<div class="alert alert-success">
<b>Reviewer's comment v1:</b>
    
Well done! Required library has been imported.

Read the `movies_and_shows.csv` file from the `datasets` folder and save it in the `df` variable:

In [2]:
file_path = "/datasets/movies_and_shows.csv"
df = pd.read_csv(file_path)

<div class="alert alert-block alert-warning">
<b>Reviewer's comment v1</b>
 
Everything is correct here; however it's a good practice to use `try/except blocks` when performing file operations or other tasks that might fail due to external reasons, such as the file not being present, issues with file permissions, or incorrect file formats. This way, you can handle errors gracefully and provide a more user-friendly error message, rather than having the program crash unexpectedly.

Here's how you can implement it:

```
try:
    orders = pd.read_csv(local_path['orders'], sep=';')

except FileNotFoundError:
    orders = pd.read_csv(server_path['orders'], sep=';')
```

Print the first 10 table rows:

In [3]:
print(df.head(10))


              name                      Character   r0le        TITLE   Type  \
0   Robert De Niro                  Travis Bickle  ACTOR  Taxi Driver  MOVIE   
1     Jodie Foster                  Iris Steensma  ACTOR  Taxi Driver  MOVIE   
2    Albert Brooks                            Tom  ACTOR  Taxi Driver  MOVIE   
3    Harvey Keitel        Matthew 'Sport' Higgins  ACTOR  Taxi Driver  MOVIE   
4  Cybill Shepherd                          Betsy  ACTOR  Taxi Driver  MOVIE   
5      Peter Boyle                         Wizard  ACTOR  Taxi Driver  MOVIE   
6   Leonard Harris      Senator Charles Palantine  ACTOR  Taxi Driver  MOVIE   
7   Diahnne Abbott                Concession Girl  ACTOR  Taxi Driver  MOVIE   
8      Gino Ardito             Policeman at Rally  ACTOR  Taxi Driver  MOVIE   
9  Martin Scorsese  Passenger Watching Silhouette  ACTOR  Taxi Driver  MOVIE   

   release Year              genres  imdb sc0re  imdb v0tes  
0          1976  ['drama', 'crime']         8.2    808582

Obtain the general information about the table with one command:

In [4]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 85579 entries, 0 to 85578
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0      name       85579 non-null  object 
 1   Character     85579 non-null  object 
 2   r0le          85579 non-null  object 
 3   TITLE         85578 non-null  object 
 4     Type        85579 non-null  object 
 5   release Year  85579 non-null  int64  
 6   genres        85579 non-null  object 
 7   imdb sc0re    80970 non-null  float64
 8   imdb v0tes    80853 non-null  float64
dtypes: float64(2), int64(1), object(6)
memory usage: 5.9+ MB
None


The table contains nine columns. The majority store the same data type: object. The only exceptions are `'release Year'` (int64 type), `'imdb sc0re'` (float64 type) and `'imdb v0tes'` (float64 type). Scores and votes will be used in our analysis, so it's important to verify that they are present in the dataframe in the appropriate numeric format. Three columns (`'TITLE'`, `'imdb sc0re'` and `'imdb v0tes'`) have missing values.

According to the documentation:
- `'name'` — actor/director's name and last name
- `'Character'` — character played (for actors)
- `'r0le '` — the person's contribution to the title (it can be in the capacity of either actor or director)
- `'TITLE '` — title of the movie (show)
- `'  Type'` — show or movie
- `'release Year'` — year when movie (show) was released
- `'genres'` — list of genres under which the movie (show) falls
- `'imdb sc0re'` — score on IMDb
- `'imdb v0tes'` — votes on IMDb

We can see three issues with the column names:
1. Some names are uppercase, while others are lowercase.
2. There are names containing whitespace.
3. A few column names have digit '0' instead of letter 'o'. 


### Conclusions <a id='data_review_conclusions'></a> 

Each row in the table stores data about a movie or show. The columns can be divided into two categories: the first is about the roles held by different people who worked on the movie or show (role, name of the actor or director, and character if the row is about an actor); the second category is information about the movie or show itself (title, release year, genre, imdb figures).

It's clear that there is sufficient data to do the analysis and evaluate our assumption. However, to move forward, we need to preprocess the data.

<div class="alert alert-success">
<b>Reviewer's comment v1:</b>
    
Great data overview and correct conclusions. 

## Stage 2. Data preprocessing <a id='data_preprocessing'></a>
Correct the formatting in the column headers and deal with the missing values. Then, check whether there are duplicates in the data.

In [5]:
# the list of column names in the df table
# Correct formatting in column headers
df.rename(columns=lambda x: x.strip().replace(' ', '_').lower(), inplace=True)

# Deal with missing values
df.dropna(inplace=True)  # Drop rows with any missing values

# Check for duplicates
duplicates = df.duplicated()
num_duplicates = duplicates.sum()

if num_duplicates > 0:
    print("There are duplicates in the data.")
    print(df[duplicates])  # Display duplicate rows
else:
    print("There are no duplicates in the data.")


There are duplicates in the data.
                      name                  character      r0le  \
7561         Philip Greene  Baseball Fan (uncredited)     ACTOR   
14512             Dan Levy                   Reporter     ACTOR   
18952     Nicolas Le Nev??                    unknown  DIRECTOR   
22456    John Iii Franklin                    Himself     ACTOR   
29557         Claudio Roca                   Nicol?­s     ACTOR   
...                    ...                        ...       ...   
85569       Jessica Cediel            Liliana Navarro     ACTOR   
85570  Javier Gardeaz?­bal   Agust??n "Peluca" Ort??z     ACTOR   
85571        Carla Giraldo             Valery Reinoso     ACTOR   
85572  Ana Mar??a S?­nchez                    Lourdes     ACTOR   
85577         Isabel Gaona                     Cacica     ACTOR   

                                 title   type  release_year  \
7561                   How Do You Know  MOVIE          2010   
14512  A Very Harold & Kumar Christ

Change the column names according to the rules of good style:
* If the name has several words, use snake_case
* All characters must be lowercase
* Remove whitespace
* Replace zero with letter 'o'

In [6]:

# Define a function to convert column names to snake_case and replace '0' with 'o'
def convert_to_snake_case(column_name):
    return column_name.strip().replace(' ', '_').replace('0', 'o').lower()

# Rename columns using the defined function
df.rename(columns=convert_to_snake_case, inplace=True)

Check the result. Print the names of the columns once more:

In [7]:
# checking result: the list of column names
print(df.columns)

Index(['name', 'character', 'role', 'title', 'type', 'release_year', 'genres',
       'imdb_score', 'imdb_votes'],
      dtype='object')


<div class="alert alert-success">
<b>Reviewer's comment v1:</b>
    
It is great that you correctly applied `replace` function. As a second approach a `rename` function could be used. 

### Missing values <a id='missing_values'></a>
First, find the number of missing values in the table. To do so, combine two `pandas` methods:

In [8]:
missing_values = df.isna().sum()

Not all missing values affect the research: the single missing value in `'title'` is not critical. The missing values in columns `'imdb_score'` and `'imdb_votes'` represent around 6% of all records (4,609 and 4,726, respectively, of the total 85,579). This could potentially affect our research. To avoid this issue, we will drop rows with missing values in the `'imdb_score'` and `'imdb_votes'` columns.

In [9]:
df.dropna(subset=['imdb_score', 'imdb_votes'], inplace=True)

In [10]:
missing_values_after_drop = df.isna().sum()
print("Number of missing values after dropping rows:", missing_values_after_drop)

Number of missing values after dropping rows: name            0
character       0
role            0
title           0
type            0
release_year    0
genres          0
imdb_score      0
imdb_votes      0
dtype: int64


<div class="alert alert-success">
<b>Reviewer's comment v1</b>
    
Great that you selected `isna()` method to find missing values! 

It is also sometimes helpful to check not only the total amount of missing values in each column but also look at the percentage of missing values. It helps to understand the overall impact. You can check percentage using, for example, this code:

`df.isnull().sum()/len(df)`

In [11]:
missing_values_per_column = df.isnull().sum()

# Check if there are any missing values left in the DataFrame
total_missing_values = missing_values_per_column.sum()
# Check if there are any missing values left in the DataFrame
if total_missing_values == 0:
    print("There are no missing values in the DataFrame after dropping rows.")
else:
    print("There are still missing values in the DataFrame after dropping rows.")

There are no missing values in the DataFrame after dropping rows.


<div class="alert alert-success">
<b>Reviewer's comment v1:</b>
    
Everything is correct here!

### Duplicates <a id='duplicates'></a>
Find the number of duplicate rows in the table using one command:

In [12]:
num_duplicate_rows = df.duplicated().sum()

Review the duplicate rows to determine if removing them would distort our dataset.

In [13]:
# Find duplicate rows
duplicate_rows = df[df.duplicated()]

# Print duplicate rows
print("Duplicate Rows:")
print(duplicate_rows)

Duplicate Rows:
                      name                  character      role  \
7561         Philip Greene  Baseball Fan (uncredited)     ACTOR   
14512             Dan Levy                   Reporter     ACTOR   
18952     Nicolas Le Nev??                    unknown  DIRECTOR   
22456    John Iii Franklin                    Himself     ACTOR   
29557         Claudio Roca                   Nicol?­s     ACTOR   
...                    ...                        ...       ...   
85569       Jessica Cediel            Liliana Navarro     ACTOR   
85570  Javier Gardeaz?­bal   Agust??n "Peluca" Ort??z     ACTOR   
85571        Carla Giraldo             Valery Reinoso     ACTOR   
85572  Ana Mar??a S?­nchez                    Lourdes     ACTOR   
85577         Isabel Gaona                     Cacica     ACTOR   

                                 title   type  release_year  \
7561                   How Do You Know  MOVIE          2010   
14512  A Very Harold & Kumar Christmas  MOVIE        

<div class="alert alert-warning">
<b>Reviewer's comment v1</b>

I noticed you've been using the `print` function to view dataframes. While `print` does the job of showing the results, Jupyter Notebooks offer a more powerful and visually appealing option through the `display` function.

Using display instead of print has several benefits, especially for displaying pandas DataFrames:

- Improved Readability: display renders DataFrame in a nicely formatted HTML table that is easier to read and interpret compared to the plain text output of print.
- Better Formatting: With display, the output takes advantage of HTML styling, which means your data can be presented with better spacing, alignment, and even coloring for different data types.
- Interactivity: Jupyter Notebooks can integrate with tools like IPython.display to provide interactive features in displaying complex objects, images, and even interactive widgets alongside DataFrames.
    
For example, instead of using `print(last5dupe)`, you can simply call `display(last5dupe)` to see a more readable and visually appealing table of your sorted data.

There are two clear duplicates in the printed rows. We can safely remove them.
Call the `pandas` method for getting rid of duplicate rows:

In [14]:
# Remove duplicate rows
df.drop_duplicates(inplace=True)

<div class="alert alert-warning">
<b>Reviewer's comment v1</b>
    
Note that it is also required additionally to re-create the indexes in your dataframe. To achieve that, you can use `reset_index()`.
    
Without that, the index column will not be ordinal anymore, as `drop_duplicates()` deleted some lines, hence the dataframe becomes less informative.
    
    
You can read about it here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.reset_index.html
    
And parameter `drop=True` could be helpful to rewrite existent index instead of creating a second ordinal column.

Check for duplicate rows once more to make sure you have removed all of them:

In [15]:
# Check for duplicate rows after removal
duplicate_rows_after_removal = df[df.duplicated()]

# Print duplicate rows (if any)
if duplicate_rows_after_removal.empty:
    print("No duplicate rows found after removal.")
else:
    print("Duplicate Rows After Removal:")
    print(duplicate_rows_after_removal)

No duplicate rows found after removal.


Now get rid of implicit duplicates in the `'type'` column. For example, the string `'SHOW'` can be written in different ways. These kinds of errors will also affect the result.

Print a list of unique `'type'` names, sorted in alphabetical order. To do so:
* Retrieve the intended dataframe column 
* Apply a sorting method to it
* For the sorted column, call the method that will return all unique column values

In [16]:
# Retrieve the 'type' column
type_column = df['type']

# Sort the 'type' column alphabetically
sorted_types = type_column.sort_values()

# Get the unique values
unique_types = sorted_types.unique()

# Print the unique 'type' names
for t in unique_types:
    print(t)

MOVIE
SHOW
movies
shows
the movie
tv
tv series
tv show
tv shows


Look through the list to find implicit duplicates of `'show'` (`'movie'` duplicates will be ignored since the assumption is about shows). These could be names written incorrectly or alternative names of the same genre.

You will see the following implicit duplicates:
* `'shows'`
* `'SHOW'`
* `'tv show'`
* `'tv shows'`
* `'tv series'`
* `'tv'`

To get rid of them, declare the function `replace_wrong_show()` with two parameters: 
* `wrong_shows_list=` — the list of duplicates
* `correct_show=` — the string with the correct value

The function should correct the names in the `'type'` column from the `df` table (i.e., replace each value from the `wrong_shows_list` list with the value in `correct_show`).

In [17]:
def replace_wrong_show(wrong_shows_list, correct_show):
    for show in wrong_shows_list:
        df['type'].replace(show, correct_show, inplace=True)
# REMOVED 'MOVIE' FROM SHOW NAMES        
# Given list of incorrect 'show' names
wrong_shows_list = ['SHOW', 'shows', 'tv', 'tv series', 'tv show', 'tv shows']
# Correct name for the shows
correct_show = 'SHOW'
# Call the function with the lists defined above
replace_wrong_show(wrong_shows_list, correct_show)

Call `replace_wrong_show()` and pass it arguments so that it clears implicit duplicates and replaces them with `SHOW`:

In [18]:
replace_wrong_show(['shows', 'SHOW', 'tv show', 'tv shows', 'tv series', 'tv'], 'SHOW')


Make sure the duplicate names are removed. Print the list of unique values from the `'type'` column:

In [19]:
unique_types = df['type'].drop_duplicates().tolist()
print(unique_types)


['MOVIE', 'the movie', 'SHOW', 'movies']


<div class="alert alert-danger">
<b>Reviewer's comment v1</b>
    
`movie` duplicates should be ignored since the assumption is about shows. 
    
Could you please update that? 

<div class="alert alert-success">
<b>Reviewer's comment v2</b>
    
Well done! 

### Conclusions <a id='data_preprocessing_conclusions'></a>
We detected three issues with the data:

- Incorrect header styles
- Missing values
- Duplicate rows and implicit duplicates

The headers have been cleaned up to make processing the table simpler.

All rows with missing values have been removed. 

The absence of duplicates will make the results more precise and easier to understand.

Now we can move on to our analysis of the prepared data.

## Stage 3. Data analysis <a id='hypotheses'></a>

Based on the previous project stages, you can now define how the assumption will be checked. Calculate the average amount of votes for each score (this data is available in the `imdb_score` and `imdb_votes` columns), and then check how these averages relate to each other. If the averages for shows with the highest scores are bigger than those for shows with lower scores, the assumption appears to be true.

Based on this, complete the following steps:

- Filter the dataframe to only include shows released in 1999 or later.
- Group scores into buckets by rounding the values of the appropriate column (a set of 1-10 integers will help us make the outcome of our calculations more evident without damaging the quality of our research).
- Identify outliers among scores based on their number of votes, and exclude scores with few votes.
- Calculate the average votes for each score and check whether the assumption matches the results.

To filter the dataframe and only include shows released in 1999 or later, you will take two steps. First, keep only titles published in 1999 or later in our dataframe. Then, filter the table to only contain shows (movies will be removed).

In [20]:
# Filter the dataframe to only include shows released in 1999 or later
shows_1999_or_later = df[(df['release_year'] >= 1999) & (df['type'] == 'SHOW')]

# Check the first few rows to verify the filtering
print(shows_1999_or_later.head())

                  name                character   role      title  type  \
1664       Jeff Probst           Himself - Host  ACTOR   Survivor  SHOW   
2076     Mayumi Tanaka  Monkey D. Luffy (voice)  ACTOR  One Piece  SHOW   
2077      Kazuya Nakai     Roronoa Zoro (voice)  ACTOR  One Piece  SHOW   
2078     Akemi Okamura             Nami (voice)  ACTOR  One Piece  SHOW   
2079  Kappei Yamaguchi            Usopp (voice)  ACTOR  One Piece  SHOW   

      release_year                                             genres  \
1664          2000                                        ['reality']   
2076          1999  ['animation', 'action', 'comedy', 'drama', 'fa...   
2077          1999  ['animation', 'action', 'comedy', 'drama', 'fa...   
2078          1999  ['animation', 'action', 'comedy', 'drama', 'fa...   
2079          1999  ['animation', 'action', 'comedy', 'drama', 'fa...   

      imdb_score  imdb_votes  
1664         7.4     24687.0  
2076         8.8    117129.0  
2077         8.8 

<div class="alert alert-success">
<b>Reviewer's comment v1</b>

You've filterd the data required data correctly. 
    
    
Additionally there are other methods that could be helpful to filter your data, for example:`query()`.

`df_filtered = df.query('type == "show"')`
    
You can read about them additionally here: 
https://towardsdatascience.com/10-examples-that-will-make-you-use-pandas-query-function-more-often-a8fb3e9361cb
https://towardsdatascience.com/how-to-use-loc-and-iloc-for-selecting-data-in-pandas-bd09cb4c3d79    

In [21]:
# Filter the dataframe to only include shows (movies are removed)
shows_only = df[df['type'] == 'SHOW']

# Check the first few rows to verify the filtering
print(shows_only.head())


               name             character   role  \
163  Graham Chapman               Various  ACTOR   
164   Michael Palin  Various / "It's" man  ACTOR   
165     Terry Jones               Various  ACTOR   
166       Eric Idle               Various  ACTOR   
167   Terry Gilliam               Various  ACTOR   

                            title  type  release_year                  genres  \
163  Monty Python's Flying Circus  SHOW          1969  ['comedy', 'european']   
164  Monty Python's Flying Circus  SHOW          1969  ['comedy', 'european']   
165  Monty Python's Flying Circus  SHOW          1969  ['comedy', 'european']   
166  Monty Python's Flying Circus  SHOW          1969  ['comedy', 'european']   
167  Monty Python's Flying Circus  SHOW          1969  ['comedy', 'european']   

     imdb_score  imdb_votes  
163         8.8     73424.0  
164         8.8     73424.0  
165         8.8     73424.0  
166         8.8     73424.0  
167         8.8     73424.0  


The scores that are to be grouped should be rounded. For instance, titles with scores like 7.8, 8.1, and 8.3 will all be placed in the same bucket with a score of 8.

In [22]:
# Round the 'imdb_score' column
df['rounded_score'] = df['imdb_score'].round()

# Check the outcome with tail()
print(df[['imdb_score', 'rounded_score']].tail())

       imdb_score  rounded_score
85573         3.8            4.0
85574         3.8            4.0
85575         3.8            4.0
85576         3.8            4.0
85578         3.8            4.0


It is now time to identify outliers based on the number of votes.

In [23]:
# Group the data by rounded scores and count unique values of imdb_votes in each group
outliers = df.groupby('rounded_score')['imdb_votes'].nunique()

# Print the result
print(outliers)

rounded_score
2.0       21
3.0       51
4.0      215
5.0      539
6.0     1366
7.0     1435
8.0      920
9.0       85
10.0       1
Name: imdb_votes, dtype: int64


Based on the aggregation performed, it is evident that scores 2 (24 voted shows), 3 (27 voted shows), and 10 (only 8 voted shows) are outliers. There isn't enough data for these scores for the average number of votes to be meaningful.

To obtain the mean numbers of votes for the selected scores (we identified a range of 4-9 as acceptable), use conditional filtering and grouping.

In [24]:
# Filter the DataFrame to include scores in the range 4-9
filtered_df = df[(df['rounded_score'] >= 4) & (df['rounded_score'] <= 9)]
# Used rounded_score column
# Group the filtered DataFrame by 'rounded_score' and calculate the mean number of votes
average_votes_per_score = filtered_df.groupby('rounded_score')['imdb_votes'].mean().reset_index()

# Print the result
print(average_votes_per_score)


   rounded_score     imdb_votes
0            4.0    8943.007876
1            5.0   14421.066403
2            6.0   29552.217666
3            7.0   54061.490793
4            8.0  162813.093748
5            9.0  577107.221778


<div class="alert alert-danger">
<b>Reviewer's comment v1</b>
    
Here and below `rounded_score` column should be used. 

<div class="alert alert-success">
<b>Reviewer's comment v2</b>
    
Thank you for updating that. Everything is correct now. 

Now for the final step! Round the column with the averages, rename both columns, and print the dataframe in descending order.

In [25]:
# Round the column with averages
average_votes_per_score['imdb_votes'] = average_votes_per_score['imdb_votes'].round()
# Used rounded_score column
# Rename columns
average_votes_per_score.columns = ['rounded_score', 'average_votes']

# Print the DataFrame in descending order
print(average_votes_per_score.sort_values(by='average_votes', ascending=False))


   rounded_score  average_votes
5            9.0       577107.0
4            8.0       162813.0
3            7.0        54061.0
2            6.0        29552.0
1            5.0        14421.0
0            4.0         8943.0


The assumption macthes the analysis: the shows with the top 3 scores have the most amounts of votes.

## Conclusion <a id='hypotheses'></a>

The research done confirms that highly-rated shows released during the "Golden Age" of television also have the most votes. While shows with score 4 have more votes than ones with scores 5 and 6, the top three (scores 7-9) have the largest number. The data studied represents around 94% of the original set, so we can be confident in our findings.

<div class="alert alert-success">
<b>Reviewer's comment v1:</b>
 
Overall you did a great research. I only left some feedback in the notebook.
    
A small tip regarding overall conclusion in the project: 
    It represents the overall work progress that you achieved. On a real project, this is probably the only thing the business will read. Therefore, it is crucial to indicate in a structured way all conclusions that you made on each step in the project.

For example:

- Replaced missing values in the following data with the following method.
- Replaced data types in the following columns.
- etc.
- We observe that ... factors impact ... 
- My analysis show ...
- I can recommend the following next steps / activities ...

It is also important to provide explanations and interpretations that will be interesting for business based on your analysis.

</div>