# Chocolate Bar Ratings Dataset

In this assignment, you will use a dataset with the ratings of several chocolate bars produced by companies located in different parts of the world. The dataset is a pre-processed version of the original dataset that can be found following this [link](https://www.kaggle.com/rtatman/chocolate-bar-ratings).

Your focus will be on the ratings of chocolate bars produced in the UK and Switzerland. The ratings are in the range of 1-5; the higher the better.

### Importing Libraries

Before proceeding to the questions, ensure that you run the code cell to import the necessary libraries.

In [1]:
# Import the usual suspects for data manipulation, statistics, and visualisations.
import pandas as pd # used to "tidy" up and manipulate our data
import numpy as np # used for matrix and numerical calculations; the foundation of pandas
from scipy import stats # contains stats functions and is used to visualise probability distributions
import matplotlib.pyplot as plt # used for visualisations
import seaborn as sns # a more user-friendly library used for visualisations

### Dataset and variables

**1. Load the dataset `flavors_cacao.csv` into a DataFrame called `choco_df` using the file path `'data/flavors_cacao.csv'` and display the first 5 rows of the DataFrame using the `.head()` method.**

See below code syntax for some guidance:
```python
choco_df = pd.read_csv(<file_path>)
choco_df.head()
```

In [2]:
#add your code below
#choco_df = ...

choco_df = pd.read_csv('data/flavors_cacao.csv')
choco_df.head()

**2. Using the `.loc` method select the column named `rating` from the DataFrame `choco_df`, Then filter the rows where the value in the column `company_location` is equal to `'U.K.'`. Store the resulting data in a variable called `uk_ratings`.**

You can use `choco_df['company_location'] == 'U.K.'` as the filtering criteria while filtering rows.

See below code syntax for some guidance:
```python
uk_ratings = choco_df.loc[<filtering_criteria>,<column_name>]
uk_ratings
```
Your answer should be a Pandas Series.

In [3]:
#add your code below
#uk_ratings = ...

filter_uk = choco_df['company_location'] == 'U.K.'

uk_ratings = choco_df.loc[filter_uk,'rating']
uk_ratings


**3. Using the `.loc` method select the column named `rating` from the DataFrame `choco_df`, Then filter the rows where the value in the column `company_location` is equal to `'Switzerland'`. Store the resulting data in a variable called `swiss_ratings`.**

You can use `choco_df['company_location'] == 'Switzerland'` as the filtering criteria while filtering rows.

See below code syntax for some guidance:
```python
swiss_ratings = choco_df.loc[<filtering_criteria>,<column_name>]
swiss_ratings
```
Your answer should be a Pandas Series.

In [4]:
#add your code below
#swiss_ratings = ...

filter_swiss = choco_df['company_location'] == 'Switzerland'
swiss_ratings = choco_df.loc[filter_swiss,'rating']
swiss_ratings


**4. How many rows are in `uk_ratings`?** 

To determine the number of rows in the `uk_ratings` Pandas Series, you can use the `.shape[0]` attribute or `len()` function.

Store your answer in a variable called `uk_rows`

In [5]:
#add your code below
#uk_rows = ...

uk_rows = uk_ratings.shape[0]

**5. What is the mean rating of the chocolate produced by companies in the UK?** 

Refer to the `uk_ratings` Pandas Series. To calculate the mean rating of the chocolate produced by companies in the UK, you can use the NumPy `np.mean()` function.

See below code syntax for some guidance:
```python
np.mean(uk_ratings)
```
Store your answer in a variable called `uk_mean_rating`

In [6]:
#add your code below
#uk_mean_rating = ...

uk_mean_rating = np.mean(uk_ratings)
uk_mean_rating

**6. What is the median rating of the chocolate produced by companies in the UK?** 

Refer to the `uk_ratings` Pandas Series. To calculate the median rating of the chocolate produced by companies in the UK, you can use the NumPy `np.median()` function.

See below code syntax for some guidance:
```python
np.median(uk_ratings)
```
Store your answer in a variable called `uk_median_rating`

In [7]:
#add your code below
#uk_median_rating = ...
uk_median_rating = np.median(uk_ratings)
uk_median_rating

**7. What is the Standard Error of the Mean (SEM) of the ratings of the chocolates produced by UK companies?** 

Refer to the `uk_ratings` Pandas Series. To calculate the `Standard Error of the Mean (SEM)` of the ratings of the chocolates produced by UK companies using the scipy library, you can utilize the sem() function from the stats module: `stats.sem()`.

See below code syntax for some guidance:
```python
stats.sem(uk_ratings)
```
Store your answer in a variable called `uk_ratings_sem`

In [8]:
#add your code below
#uk_ratings_sem = ...
uk_ratings_sem = stats.sem(uk_ratings)
uk_ratings_sem

**8. How many rows are in `swiss_ratings`?** 

To determine the number of rows in the `swiss_ratings` Pandas Series, you can use the `.shape[0]` attribute or `len()` function.

Store your answer in a variable called `swiss_rows`

In [9]:
#add your code below
#swiss_rows = ...
swiss_rows = swiss_ratings.shape[0]
swiss_rows

**9. What is the mean rating of the chocolate produced by Swiss companies?** 

Refer to the `swiss_ratings` Pandas Series. To calculate the mean rating of the chocolate produced by  Swiss companies, you can use the NumPy `np.mean()` function.

See below code syntax for some guidance:
```python
np.mean(swiss_ratings)
```
Store your answer in a variable called `swiss_mean_rating`

In [10]:
#add your code below
#swiss_mean_rating = ...
swiss_mean_rating = np.mean(swiss_ratings)
swiss_mean_rating

**10. What is the median rating of the chocolate produced by Swiss companies?** 

Refer to the `swiss_ratings` Pandas Series. To calculate the median rating of the chocolate produced by  Swiss companies, you can use the NumPy `np.median()` function.

See below code syntax for some guidance:
```python
np.median(swiss_ratings)
```
Store your answer in a variable called `swiss_median_rating`

In [11]:
#add your code below
#swiss_median_rating = ...
swiss_median_rating = np.median(swiss_ratings)
swiss_median_rating

**11. What is the Standard Error of the Mean (SEM) of the ratings of the chocolate produced by Swiss companies?** 

Refer to the `swiss_ratings` Pandas Series. To calculate the `Standard Error of the Mean (SEM)` of the ratings of the chocolates produced by Swiss companies using the scipy library, you can utilize the sem() function from the stats module: `stats.sem()`.

See below code syntax for some guidance:
```python
stats.sem(swiss_ratings)
```
Store your answer in a variable called `swiss_ratings_sem`

In [12]:
#add your code below
#swiss_ratings_sem = ...
swiss_ratings_sem = stats.sem(swiss_ratings)
swiss_ratings_sem

**12. Use box plots to compare ratings from the UK and Switzerland**

Refer to the `choco_df` DataFrame. Use `sns.boxplot()` to plot `rating` against `company_location` in order to compare national ratings.

* Ensure you save your plot to the variable called `ratings_comp`.
* Ensure the plot has the x-label 'Rating'
* Ensure the plot has the y-label 'Location'

See below code syntax for some guidance:
```python
ratings_comp = sns.violinplot(x=..., y=..., data=...)
ratings_comp.set(xlabel=..., ylabel=...)
```

In [13]:
#add your code below
ratings_comp = sns.violinplot(x='company_location', y='rating', data=choco_df)
ratings_comp.set(xlabel='Rating', ylabel='Location')
ratings_comp

### Outlier detection

**Are there any outliers in our data that might be skewing our statistics?**

An outlier is an extreme value that lies outside the overall pattern of the data.

There are several different ways of determining outliers, depending on the nature of the data and the calculations that you are asked to carry out.

Here we will use the definition that an outlier is any value that is:
- either greater than $Q_{3} + 1.5(Q_{3}-Q_{1})$
- or less than $Q_{1} - 1.5(Q_{3}-Q_{1})$

where $Q_{3}$ is the upper quartile, and $Q_{1}$ is the lower quartile.

<blockquote> Note: This is the default criteria in Seaborn used to mark outliers on boxplot diagrams like the one you created in the previous question </blockquote>

**13 Create a function called `find_outliers` that accepts a DataFrame as input. The function will determine outliers in the `rating` column of the DataFrame using the specified criteria.** 

**Upon execution, the function will return a list named `outliers` containing the identified outliers. It is assumed that the DataFrame has the same format as `choco_df` and includes a `rating` column.**

See below code syntax for some guidance:
```python
q1 = df['rating'].quantile(0.25)
q3 = df['rating'].quantile(0.75)
upper = q3 + 1.5*(q3-q1)
lower = q1 - 1.5*(q3-q1)
```

In [33]:
#add your code below
#def find_outliers(df):
def find_outliers(df):
    q1 = df['rating'].quantile(0.25)
    q3 = df['rating'].quantile(0.75)
    upper = q3 + 1.5*(q3-q1)
    lower = q1 - 1.5*(q3-q1)
    #mask = (df['rating']) < lower | (df['rating'] > upper)
    #outliers = df[mask]
    filtered = df[(df["rating"] < lower) | (df["rating"] > upper)]
    # credit: https://stackoverflow.com/a/43093390/227926
    outliers = filtered['rating'].to_list()
    return outliers

**14. Use your function `find_outliers` to determine the outliers in the `choco_df` data, first filtering the dataframe so it contains only ratings from the UK**

Store your answer (the output of the `find_outliers` function) in a variable called `uk_outliers`

See below code syntax for some guidance:
```python
uk_df = choco_df[choco_df['company_location']=='U.K.']
uk_outliers = find_outliers(uk_df)
uk_outliers
```

In [34]:
#add your code below
#uk_outliers = ...
uk_df = choco_df[choco_df['company_location']=='U.K.']
uk_outliers = find_outliers(uk_df)
uk_outliers


**Well done for completing the assignment! As well as being able to calculate statistics, review distributions and graphs, it is also important to develop an understanding of what information we can draw from all of this. Post a comment to the Knowledge Base about a conclusion that you have been able to draw from the data as part of your work above.**