# Module 1: Exercise

<div class="alert alert-block alert-info">
Make sure you have downloaded all data sets below from the Module Introduction page and saved them in the same folder of this notebook file.
<ul>
    <li>diabetes.csv</li>
    <li>diabetes_headless.csv</li>
    <li>diabetes_nrow.csv</li>
    <li>diabetes_semicol.csv</li>
</div>

>__Task 1__
>
>Run the following cells to check your Python version and present working directory

In [4]:
# Check Python version
!python -V

Python 3.12.4


In [3]:
# Check present directory
%pwd

'/Users/Tarun/Desktop/Projects/uwaterloo-python4ml/Module1'

## Load Python Packages

>__Task 2__
>
>In the cell below, import pandas and NumPy using their alias (`pd` and `np`)

In [2]:
import pandas as pd
import numpy as np

## Read the Data Set

>__Task 3__
>
>Import and explore the data file "diabetes.csv"

In [9]:
# Import the data set
diabetes = pd.read_csv('./rawfiles/diabetes.csv')

# View the first 5 rows
diabetes.head(6)

# When running the cell, you should see the output table below.

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0


We can also see the first _n_ rows by adding a number in the brackets. For example, let's have a look at the first 10 rows:

In [10]:
diabetes.head(10)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
5,5,116,74,0,0,25.6,0.201,30,0
6,3,78,50,32,88,31.0,0.248,26,1
7,10,115,0,0,0,35.3,0.134,29,0
8,2,197,70,45,543,30.5,0.158,53,1
9,8,125,96,0,0,0.0,0.232,54,1


### Parameter: `sep`

Now, let's try to read a semicolon-separated data set and see how to fix the format issue.

>__Task 4__
>
>Import and explore the data file "diabetes_semicol.csv"

In [16]:
# Import the data set
diabetes_scol = pd.read_csv('./rawfiles/diabetes_semicol.csv', sep=';')

# View the first 3 rows
diabetes_scol.head(3)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1


The data is not read properly. It expects to see separate columns but cannot find one, so all information is written in one column. To solve this problem, we need to specify the separator using `sep=';'`.

>__Task 5__
>
>Repeat the last task but add `sep=';'` within `.read_csv()` this time

In [17]:
diabetes_scol = pd.read_csv('./rawfiles/diabetes_semicol.csv', sep=';')
diabetes_scol.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Parameter: `skiprows`

Sometimes, our data set may contain information that we don't want to include for analysis. For example, let's have a look at the "diabetes_nrow.csv" data.

>__Task 6__
>
>Import and explore the data file "diabetes_nrow.csv"

In [19]:
# Import the data set
diabetes_nrow = pd.read_csv('./rawfiles/diabetes_nrow.csv')

# View the first 5 rows
diabetes_nrow.head(5)

Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Below you can find diabets statistics of 9 patients:
Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0


It is not read properly because it contains a description line on the top. Also note the second row is blank which does not show in the output above. We can skip the first two rows using `skiprows=2`.

>__Task 7__
>
>Repeat the last task but add `skiprows=2` within `.read_CSV()` this time

In [20]:
diabetes_nrow = pd.read_csv('./rawfiles/diabetes_nrow.csv', skiprows=2)
diabetes_nrow.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


### Parameter: `header`

When the data set doesn't have a header row, the first row will be treated as the column names. For example, let's look at the "diabetes_headless.csv" data.

>__Task 8__
>
>Import and explore the data file "diabetes_headless.csv"

In [22]:
# Import the data set
diabetes_head = pd.read_csv('./rawfiles/diabetes_headless.csv')

# View the first 5 rows
diabetes_head.head(5)

Unnamed: 0,6,148,72,35,0,33.6,0.627,50,1
0,1,85,66,29,0,26.6,0.351,31,0
1,8,183,64,0,0,23.3,0.672,32,1
2,1,89,66,23,94,28.1,0.167,21,0
3,0,137,40,35,168,43.1,2.288,33,1
4,5,116,74,0,0,25.6,0.201,30,0


To avoid this importing mistake, we need to add `header=None` to assign a header separately.

>__Task 9__
>
>Repeat the last task but add `header=None` within `.read_CSV()` this time

In [23]:
diabetes_head = pd.read_csv('./rawfiles/diabetes_headless.csv', header=None)
diabetes_head.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1


To be more specific, we can overwrite the header row with proper names using `DataFrame.columns`:

In [24]:
diabetes_head.columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
diabetes_head.head()

Unnamed: 0,Pregnancies,Glucose,BloodPressure,SkinThickness,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome
0,6,148,72,35,0,33.6,0.627,50,1
1,1,85,66,29,0,26.6,0.351,31,0
2,8,183,64,0,0,23.3,0.672,32,1
3,1,89,66,23,94,28.1,0.167,21,0
4,0,137,40,35,168,43.1,2.288,33,1
