# FEATURE ENGINEERING

You are a data scientist at a clothing company and are working with a data set of customer reviews. This dataset is originally from Kaggle and has a lot of potential for various machine learning purposes. You are tasked with transforming some of these features to make the data more useful for analysis. To do this, you will have time to practice the following:
1.Transforming categorical data
2.Scaling your data
3.Working with date-time features

# Let’s start with some basic exploring by performing the following:


First, import your dataset. It is stored under a file named insurance.csv Save it to a variable called df.

In [50]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler


In [51]:
df = pd.read_csv('insurance.csv')
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.9,0,yes,southwest,16884.924
1,18,male,33.77,1,no,southeast,1725.5523
2,28,male,33.0,3,no,southeast,4449.462
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.88,0,no,northwest,3866.8552


In [52]:

print(df.columns)


Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')


In [53]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1338 entries, 0 to 1337
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   age       1338 non-null   int64  
 1   sex       1338 non-null   object 
 2   bmi       1338 non-null   float64
 3   children  1338 non-null   int64  
 4   smoker    1338 non-null   object 
 5   region    1338 non-null   object 
 6   charges   1338 non-null   float64
dtypes: float64(2), int64(2), object(3)
memory usage: 73.3+ KB
None


# 1. Transform the smoker feature. in our dataset

 Start by printing the feature’s .value_counts().

In [54]:
var = df['smoker'].value_counts()
print(var)

no     1064
yes     274
Name: smoker, dtype: int64


#### Using binary_dict, 
transform the smoker column so that it will now be binary. Print the results using .value_counts() to confirm the transformation.

In [55]:
binary_dict = {'no':0,'yes':1}
df['smoker_val'] = df['smoker'].map(binary_dict)


In [56]:
print(df.head())

   age     sex     bmi  children smoker     region      charges  smoker_val
0   19  female  27.900         0    yes  southwest  16884.92400           1
1   18    male  33.770         1     no  southeast   1725.55230           0
2   28    male  33.000         3     no  southeast   4449.46200           0
3   33    male  22.705         0     no  northwest  21984.47061           0
4   32    male  28.880         0     no  northwest   3866.85520           0


In [57]:
print(df['smoker'].value_counts())

no     1064
yes     274
Name: smoker, dtype: int64


In [58]:
print(df['smoker_val'].value_counts())

0    1064
1     274
Name: smoker_val, dtype: int64


# 2. Next To print the value_counts() of region  column  in our dataset

In [59]:
print(df['region'].value_counts())

southeast    364
southwest    325
northwest    325
northeast    324
Name: region, dtype: int64


#### Next here we will create the region_dict and transformed these region values with values of region_dict

In [60]:
region_dict = {'southeast':4,'southwest':3,'northwest':2,'northeast':1}
region_dict

{'southeast': 4, 'southwest': 3, 'northwest': 2, 'northeast': 1}

In [61]:
df['region_val'] = df['region'].map(region_dict)
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,smoker_val,region_val
0,19,female,27.9,0,yes,southwest,16884.924,1,3
1,18,male,33.77,1,no,southeast,1725.5523,0,4
2,28,male,33.0,3,no,southeast,4449.462,0,4
3,33,male,22.705,0,no,northwest,21984.47061,0,2
4,32,male,28.88,0,no,northwest,3866.8552,0,2


Upto here we have encoded all the column have text values now only sex column is remain, here we will transform sex column to numerical values 

# 3.Transforming our sex column to numerical values

In [62]:
print(df['sex'].value_counts())

male      676
female    662
Name: sex, dtype: int64


In [63]:
# Assign 1 to male and 0 to female 
# create the sexval_dict
sexVal_dict = {"male":1,'female':0}
sexVal_dict

{'male': 1, 'female': 0}

In [65]:
df['sex_val'] = df['sex'].map(sexVal_dict)
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,smoker_val,region_val,sex_val
0,19,female,27.9,0,yes,southwest,16884.924,1,3,0
1,18,male,33.77,1,no,southeast,1725.5523,0,4,1
2,28,male,33.0,3,no,southeast,4449.462,0,4,1
3,33,male,22.705,0,no,northwest,21984.47061,0,2,1
4,32,male,28.88,0,no,northwest,3866.8552,0,2,1


# Use panda’s get_dummies() method to one-hot encode our feature. Assign this to a variable called one_hot.

In [67]:
one_hot = pd.get_dummies(df['smoker_val'])
one_hot

Unnamed: 0,0,1
0,0,1
1,1,0
2,1,0
3,1,0
4,1,0
...,...,...
1333,1,0
1334,1,0
1335,1,0
1336,1,0


Join the results from one_hot back to our original data frame. Then print out the column names. What has been added?

In [68]:
df = df.join(one_hot)
df.head()

Unnamed: 0,age,sex,bmi,children,smoker,region,charges,smoker_val,region_val,sex_val,0,1
0,19,female,27.9,0,yes,southwest,16884.924,1,3,0,0,1
1,18,male,33.77,1,no,southeast,1725.5523,0,4,1,1,0
2,28,male,33.0,3,no,southeast,4449.462,0,4,1,1,0
3,33,male,22.705,0,no,northwest,21984.47061,0,2,1,1,0
4,32,male,28.88,0,no,northwest,3866.8552,0,2,1,1,0


# Final Step
we will take in our transformation project is scaling our data. We notice that we have a wide range of numbers thus far, so it is best to put everything on the same scale.

Let’s get our data frame to only have the numerical features we created. 

In [69]:
my_df = df[['age','sex_val','bmi','children','smoker_val','region_val']].copy()
my_df.head()

Unnamed: 0,age,sex_val,bmi,children,smoker_val,region_val
0,19,0,27.9,0,1,3
1,18,1,33.77,1,0,4
2,28,1,33.0,3,0,4
3,33,1,22.705,0,0,2
4,32,1,28.88,0,0,2


Wooo now we can use it to train our machine learning model ,, we have encoded all our text features to numerical form

# We are ready to scale our data!
Perform a .fit_transform() on our data set, and print the results to see how the features have changed.

In [70]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()


In [74]:
my_df= scaler.fit_transform(my_df)

In [76]:
my_df

array([[-1.43876426, -1.0105187 , -0.45332   , -0.90861367,  1.97058663,
         0.40287427],
       [-1.50996545,  0.98959079,  0.5096211 , -0.07876719, -0.5074631 ,
         1.28800691],
       [-0.79795355,  0.98959079,  0.38330685,  1.58092576, -0.5074631 ,
         1.28800691],
       ...,
       [-1.50996545, -1.0105187 ,  1.0148781 , -0.90861367, -0.5074631 ,
         1.28800691],
       [-1.29636188, -1.0105187 , -0.79781341, -0.90861367, -0.5074631 ,
         0.40287427],
       [ 1.55168573, -1.0105187 , -0.26138796, -0.90861367,  1.97058663,
        -0.48225837]])

# Congratulations!

You have successfully completed this transformation project. Transformations are an incredibly valuable skill to have. Great job!