# BASICS OF PYTHON | SESSION 3&4

---
Sina Shafiezadeh | October 2024
---


In these two sessions, we are going to work with a fake dataset about BMI scores in **10 steps** and **20 exercises**.

# 1.&nbsp;Data Importing

In [1]:
# import packages
import pandas as pd
import numpy as np

In [None]:
# import dataset (method 1): upload from the local machine
from google.colab import files

uploaded = files.upload()
filename = next(iter(uploaded)) # get the uploaded file name

data = pd.read_csv(filename)

In [None]:
# import dataset (method 2): upload from the google drive
from google.colab import drive
drive.mount('/content/drive')
data = pd.read_csv('drive/MyDrive/BoP/bmi.csv')

In [None]:
# import dataset (method 3): download from a URL directly
import requests
url = "https://raw.githubusercontent.com/sina-shafiezadeh/Basics-of-Python-course/main/bmi.csv" # click on the "Raw" button to get the direct link to the raw file
response = requests.get(url)
open('bmi.csv', 'wb').write(response.content)
data = pd.read_csv('bmi.csv')

# 2.&nbsp;Data Cleaning (overview)

In [None]:
print(data)

In [None]:
print(data.shape)

In [None]:
print(data.size) # rows*columns

In [None]:
print(data.head(5)) # first 5 rows

In [None]:
print(data.tail(5)) # last 5 rows

In [None]:
print(data.info()) # shape, names, count, data type and memory info

In [None]:
print(data.describe()) # description for numerical columns

In [None]:
# description for categorical columns
print(data['sex'].value_counts())
print("============================")
print(data['city'].value_counts())

# 3.&nbsp;Data Cleaning (interpretability)

As much as possible, we should reduce **complexity** and increase **consistency**.

## Exercise 1

---


Change the values of "sport" time from seconds to minutes.


*   Example: 2676 (second) = 44.6 (minute)


In [None]:
# replace the new value in the column






## Exercise 2

---


Change the values in "city" from string to integer.


*   Example: city2 = 2


In [None]:
# replace the new value in the column






## Exercise 3

---


We don't need too less or too much accuracy. Remove one decimal place in the "age" value by rounding  them.


*   Example: 19.9 = 20


In [None]:
# replace the new value in the column







# 4.&nbsp;Data Cleaning (reduction)

Remove duplicate or unnecessary columns.

## Exercise 4

---


Remove the same value with a different column name (bmi_score) and the unrelated column for our analyzing goal (pet).

In [None]:
# we should have 6 columns after removing






# 5.&nbsp;Data Cleaning (noisy data)

Noisy data is all the data we don't want. It does not matter whether it is corrupted or meaningless for our analysis.

## Exercise 5

---


we don't need data that is not **timeliness**. Remove the older data.

In [None]:
# Hint: check the "time" column







## Exercise 6

---


It is essential that the data be **believable**. Check suspicious values and decide about them.

In [None]:
# Hint: check the "sport" column








## Exercise 7

---


We don't need people under the age of 18 in this analysis. Be careful, it is possible that noisy data may not be noisy in another analysis.

In [None]:
# remove the unnecessary rows







# 6.&nbsp;Data Cleaning (completeness)



We can **replace** values with nan values or **remove** them if we have enough data. In addition, based on our need for handling noisy data, we can select from several techniques to deal with it.

## Exercise 8

---


Replace the mean value with nan values in the "bmi" column. Be careful, manual replacement could need a lot of time and cause human mistakes.

In [None]:
# calculate the bmi mean over all nan values first.







## Exercise 9

---


Remove nan values in the "sex" column.

In [None]:
# we should have 4 nan values in the sex column







# 7.&nbsp;Data Exploring



We can define different scenarios as we need.

## Exercise 10

---


Select rows where **city = 1**, **bmi > 25**, and **sport <= 60**.

In [None]:
# store results in the new data frame with the name "selected_data"








## Exercise 11

---


Sort **females in age 18 to 30** by **sport** in descending order.

In [None]:
# Hint: check the "sort_values" function







## Exercise 12

---


Which city has the **highest** and which city has the **lowest** average BMI score?

In [None]:
# Hint: returns the mean of the values in "bmi", grouped by the values in "city"








# 8.&nbsp;Data Analysis

## Exercise 13

---


Which attributes are **correlated**?

In [None]:
# Hint: check "corr()" function







## Exercise 14

---


Is the difference in BMI scores between **men and women** statistically **significant**?

Firstly, define a function to calculate the statistic test and then use it to answer the question.

Note: you can access the API for suitable statistical tests [HERE](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.t.html).

In [None]:
# Hint 1: import "scipy" library and implement student's t-test by "scipy.stats.t"
# Hint 2: if the variance ratio is less than 4:1, so the variance is equal.







## Exercise 15

---


Is the difference in BMI scores of ages between **18 to 25** and age between **45 to 55** statistically **significant**?

*   Expected output:

        pvalue = 5.900678573659376e-26

In [None]:
# use the previous function







## Exercise 16

---

Is the difference in BMI scores between cities statistically **significant**?

Firstly, define a **new function** to calculate the statistic test and then use it to answer the question.

Note: you can access the API for suitable statistical tests [HERE](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html).

*   Expected output:

        pvalue = 0.1543449952170188

In [None]:
# Hint: implement the one-way ANOVA test by "scipy.stats.f_oneway"









# 9.&nbsp;Data Visualization

## Exercise 17

---


Plot a **scatter plot** for "age" and "bmi" using **matplotlib** [API](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html). You can see some practical examples [HERE](https://www.geeksforgeeks.org/matplotlib-pyplot-scatter-in-python/).

In [None]:
# first import matplotlib










## Exercise 18

---


Plot a **bar chart** for average "sport" per "city" using matplotlib [API](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.bar.html). You can see some practical examples [HERE](https://www.geeksforgeeks.org/bar-plot-in-matplotlib/).

In [None]:
# try to follow the standards of a scientific figure (title, scale,...)









## Exercise 19

---


Plot a **box plot** for "bmi" per "city" to compare results (3 box plots in the one image) using matplotlib [API](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.boxplot.html). You can see some practical examples [HERE](https://www.geeksforgeeks.org/box-plot-in-python-using-matplotlib/).

In [None]:
# try to make the colors and fonts legible








# 10.&nbsp;Data Exporting

## Exercise 20

---


Save **box plot** image in **SVG** format using matplotlib [API](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html). You can see practical example [HERE](https://www.geeksforgeeks.org/how-to-save-a-plot-to-a-file-using-matplotlib/). Next, write data after cleaning in **CSV** format  using pandas [API](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_csv.html). You can see some practical examples [HERE](https://www.geeksforgeeks.org/saving-a-pandas-dataframe-as-a-csv/).

In [None]:
# choose a meaningful name for the file names.









Congratulations! You've finished this course successfully.

# References and Resources:

1. You can continue your learning through more complex projects in [Kaggle](https://www.kaggle.com/).

2. You can also have access to neuroscience datasets in [CRCNS](https://crcns.org/data-sets).