### Task Structured Tabular Data:

#### Dataset Link:
Dataset can be found at " /data/structured_data/data.csv " in the respective challenge's repo.

#### Description:
Tabular data is usually given in csv format (comma-separated-value). CSV files can be read and manipulated using pandas and numpy library in python. Most common datatypes in structured data are 'numerical' and 'categorical' data. Data processing is required to handle missing values, inconsistent string formats, missing commas, categorical variables and other different kinds of data inadequacies that you will get to experience in this course. 

#### Objective:
How to process and manipulate basic structured data for machine learning (Check out helpful links section to get hints)

#### Tasks:
- Load the csv file (pandas.read_csv function)
- Classify columns into two groups - numerical and categorical. Print column names for each group.
- Print first 10 rows after handling missing values
- One-Hot encode the categorical data
- Standarize or normalize the numerical columns

#### Ask yourself:

- Why do we need feature encoding and scaling techniques?
- What is ordinal data and should we one-hot encode ordinal data? Are any better ways to encode it?
- What's the difference between normalization and standardization? Which technique is most suitable for this sample dataset?
- Can you solve the level-up challenge: Complete all the above tasks without using scikit-learn library ?

#### Helpful Links:
- Nice introduction to handle missing values: https://analyticsindiamag.com/5-ways-handle-missing-values-machine-learning-datasets/
- Scikit-learn documentation for one hot encoding: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
- Difference between normalization and standardization: https://medium.com/towards-artificial-intelligence/how-when-and-why-should-you-normalize-standardize-rescale-your-data-3f083def38ff

**Task1**

In [177]:
import sys
import numpy as np
import pandas as pd

#loading the data file
file = pd.read_csv("challenge-week-1/data/structured_data/data.csv")
num = []
cat = []
#Segregating Numerical Columns
print('Name and Group:')
for _,i in file.items():
    if i.dtype != object:
        num.append(i.name)
    else:
        cat.append(i.name)

sep = pd.DataFrame({'Numerical': pd.Series(num), 'Categorical': pd.Series(cat)})
sep

Name and Group:


Unnamed: 0,Numerical,Categorical
0,Age,Country
1,Salary,Purchased
2,Price Category Of Purchase,


In [178]:
#First 10 rows after handling missing value (deleting the rows which contain missing value)
file.dropna(inplace = True)
file.head(10)

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,France,44.0,72000.0,No,1
1,Spain,27.0,48000.0,Yes,1
2,Germany,30.0,54000.0,No,2
3,Spain,38.0,61000.0,No,3
5,France,35.0,58000.0,Yes,2
7,France,48.0,79000.0,Yes,1
8,Germany,50.0,83000.0,No,2
9,France,37.0,67000.0,Yes,2
10,France,18.0,54400.0,No,3
11,Germany,22.0,55000.0,Yes,3


In [179]:
#One hot encoding categorical data
from sklearn.preprocessing import LabelEncoder

#LabelEncoder object
lenc = LabelEncoder()

#Iterating through categorical data and replacing in original file after transforming
for category in cat:
    file[category] = lenc.fit_transform(file[category])

file

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase
0,0,44.0,72000.0,0,1
1,2,27.0,48000.0,1,1
2,1,30.0,54000.0,0,2
3,2,38.0,61000.0,0,3
5,0,35.0,58000.0,1,2
7,0,48.0,79000.0,1,1
8,1,50.0,83000.0,0,2
9,0,37.0,67000.0,1,2
10,0,18.0,54400.0,0,3
11,1,22.0,55000.0,1,3


In [180]:
dummies = pd.get_dummies(file["Country"])
file = pd.concat([file, dummies], axis=1)
file

Unnamed: 0,Country,Age,Salary,Purchased,Price Category Of Purchase,0,1,2
0,0,44.0,72000.0,0,1,1,0,0
1,2,27.0,48000.0,1,1,0,0,1
2,1,30.0,54000.0,0,2,0,1,0
3,2,38.0,61000.0,0,3,0,0,1
5,0,35.0,58000.0,1,2,1,0,0
7,0,48.0,79000.0,1,1,1,0,0
8,1,50.0,83000.0,0,2,0,1,0
9,0,37.0,67000.0,1,2,1,0,0
10,0,18.0,54400.0,0,3,1,0,0
11,1,22.0,55000.0,1,3,0,1,0


In [181]:
file.drop("Country", axis=1, inplace=True)
file

Unnamed: 0,Age,Salary,Purchased,Price Category Of Purchase,0,1,2
0,44.0,72000.0,0,1,1,0,0
1,27.0,48000.0,1,1,0,0,1
2,30.0,54000.0,0,2,0,1,0
3,38.0,61000.0,0,3,0,0,1
5,35.0,58000.0,1,2,1,0,0
7,48.0,79000.0,1,1,1,0,0
8,50.0,83000.0,0,2,0,1,0
9,37.0,67000.0,1,2,1,0,0
10,18.0,54400.0,0,3,1,0,0
11,22.0,55000.0,1,3,0,1,0


In [183]:
#Standardizing numerical columns
from sklearn import preprocessing
toScale = file[["Salary", "Age"]].values
scaler = preprocessing.MinMaxScaler()
scaled = scaler.fit_transform(toScale)
dfScaled = pd.DataFrame(scaled)
file["Salary"] = dfScaled[0]
file["Age"] = dfScaled[1]
file

[[0.73809524 0.8125    ]
 [0.16666667 0.28125   ]
 [0.30952381 0.375     ]
 [0.47619048 0.625     ]
 [0.9047619  0.9375    ]
 [0.61904762 0.59375   ]
 [0.31904762 0.        ]
 [0.33333333 0.125     ]
 [0.02380952 0.3125    ]
 [0.         0.1875    ]
 [0.61904762 0.4375    ]
 [0.57142857 0.625     ]
 [       nan        nan]
 [       nan        nan]
 [       nan        nan]]


Unnamed: 0,Age,Salary,Purchased,Price Category Of Purchase,0,1,2
0,0.866667,0.815789,0,1,1,0,0
1,0.3,0.184211,1,1,0,0,1
2,0.4,0.342105,0,2,0,1,0
3,0.666667,0.526316,0,3,0,0,1
5,0.633333,0.684211,1,2,1,0,0
7,0.133333,0.368421,1,1,1,0,0
8,0.333333,0.026316,0,2,0,1,0
9,0.2,0.0,1,2,1,0,0
10,0.466667,0.684211,0,3,1,0,0
11,0.666667,0.631579,1,3,0,1,0


**Task2**