## Data Wrangling I
Perform the following operations using Python on any open source dataset (e.g., data.csv)
1. Import all the required Python Libraries.
2. Locate an open source data from the web (e.g., https://www.kaggle.com). Provide a clear
 description of the data and its source (i.e., URL of the web site).
3. Load the Dataset into pandas dataframe.
4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe()
function to get some initial statistics. Provide variable descriptions. Types of variables etc.
Check the dimensions of the data frame.
5. Data Formatting and Data Normalization: Summarize the types of variables by checking
the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the
data set. If variables are not in the correct data type, apply proper type conversions.
6. Turn categorical variables into quantitative variables in Python.

In [2]:
import numpy as np
import pandas as pd

In [3]:
# Load the Dataset into pandas dataframe.
df = pd.read_csv('airquality.csv')

In [4]:
df

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day,humidity
0,1,41.0,190.0,7.4,67,5,1,high
1,2,36.0,118.0,8.0,72,5,2,high
2,3,12.0,149.0,12.6,74,5,3,high
3,4,18.0,313.0,11.5,62,5,4,high
4,5,,,14.3,56,5,5,high
...,...,...,...,...,...,...,...,...
148,149,30.0,193.0,6.9,70,9,26,high
149,150,,145.0,13.2,77,9,27,high
150,151,14.0,191.0,14.3,75,9,28,high
151,152,18.0,131.0,8.0,76,9,29,high


In [5]:
# check for missing values in the data using pandas isnull(), describe() function 
df.isnull()

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day,humidity
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,True,True,False,False,False,False,False
...,...,...,...,...,...,...,...,...
148,False,False,False,False,False,False,False,False
149,False,True,False,False,False,False,False,False
150,False,False,False,False,False,False,False,False
151,False,False,False,False,False,False,False,False


In [6]:
df.isnull().sum()

Unnamed: 0     0
Ozone         37
Solar.R        7
Wind           0
Temp           0
Month          0
Day            0
humidity       4
dtype: int64

In [7]:
df.describe()

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day
count,153.0,116.0,146.0,153.0,153.0,153.0,153.0
mean,77.0,42.12931,185.931507,9.957516,77.882353,6.993464,15.803922
std,44.311398,32.987885,90.058422,3.523001,9.46527,1.416522,8.86452
min,1.0,1.0,7.0,1.7,56.0,5.0,1.0
25%,39.0,18.0,115.75,7.4,72.0,6.0,8.0
50%,77.0,31.5,205.0,9.7,79.0,7.0,16.0
75%,115.0,63.25,258.75,11.5,85.0,8.0,23.0
max,153.0,168.0,334.0,20.7,97.0,9.0,31.0


In [8]:
# Check the dimensions of the data frame.
df.shape

(153, 8)

In [9]:
#  Provide variable descriptions. Types of variables etc.
df.dtypes

Unnamed: 0      int64
Ozone         float64
Solar.R       float64
Wind          float64
Temp            int64
Month           int64
Day             int64
humidity       object
dtype: object

In [10]:
# Turn categorical variables into quantitative variables in Python.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

In [11]:
df['humidity'] = le.fit_transform(df['humidity'])

In [12]:
df

Unnamed: 0.1,Unnamed: 0,Ozone,Solar.R,Wind,Temp,Month,Day,humidity
0,1,41.0,190.0,7.4,67,5,1,0
1,2,36.0,118.0,8.0,72,5,2,0
2,3,12.0,149.0,12.6,74,5,3,0
3,4,18.0,313.0,11.5,62,5,4,0
4,5,,,14.3,56,5,5,0
...,...,...,...,...,...,...,...,...
148,149,30.0,193.0,6.9,70,9,26,0
149,150,,145.0,13.2,77,9,27,0
150,151,14.0,191.0,14.3,75,9,28,0
151,152,18.0,131.0,8.0,76,9,29,0


In [13]:
df['humidity'].unique()

array([0, 1, 2, 3])

In [15]:
df.dtypes

Unnamed: 0      int64
Ozone         float64
Solar.R       float64
Wind          float64
Temp            int64
Month           int64
Day             int64
humidity        int32
dtype: object

In [19]:
# If variables are not in the correct data type, apply proper type conversions.
df['Unnamed: 0'] = df['Unnamed: 0'].astype('float')

In [20]:
df.dtypes

Unnamed: 0    float64
Ozone         float64
Solar.R       float64
Wind          float64
Temp            int64
Month           int64
Day             int64
humidity        int32
dtype: object