<a href="https://colab.research.google.com/github/uday-routhu/week6/blob/master/Feature_Engineering_Exercise_Core.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering Exercise (Core)

* Author: Udayakumar Routhu

In [1]:
# Imports
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.decomposition import PCA
import seaborn as sns

In [2]:
# Mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### Import the data the drop the 'casual' and 'registered' columns. These are redundant with your target, 'count'.

In [26]:
url = '/content/drive/MyDrive/CodingDojo/03-AdvancedML/Week10/Data/bikeshare_train - bikeshare_train.csv'
df = pd.read_csv(url)

In [27]:
df = df.drop(['casual', 'registered'], axis=1)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 10 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   datetime    10886 non-null  object 
 1   season      10886 non-null  int64  
 2   holiday     10886 non-null  int64  
 3   workingday  10886 non-null  int64  
 4   weather     10886 non-null  int64  
 5   temp        10886 non-null  float64
 6   atemp       10886 non-null  float64
 7   humidity    10886 non-null  int64  
 8   windspeed   10886 non-null  float64
 9   count       10886 non-null  int64  
dtypes: float64(3), int64(6), object(1)
memory usage: 850.6+ KB


In [29]:
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count
0,2011-01-01 0:00:00,1,0,0,1,9.84,14.395,81,0.0,16
1,2011-01-01 1:00:00,1,0,0,1,9.02,13.635,80,0.0,40
2,2011-01-01 2:00:00,1,0,0,1,9.02,13.635,80,0.0,32
3,2011-01-01 3:00:00,1,0,0,1,9.84,14.395,75,0.0,13
4,2011-01-01 4:00:00,1,0,0,1,9.84,14.395,75,0.0,1


### Transform the 'datetime' column into a datetime type and use it to create 3 new columns in the data frame containing the:

In [30]:
df['datetime'] = pd.to_datetime(df['datetime'])

1. Name of the Month

In [31]:
df['month'] = df['datetime'].dt.strftime('%B')

2. Name of the Day of the Week

In [32]:
df['day_of_week'] = df['datetime'].dt.strftime('%A')

3. Hour of the Day

In [33]:
df['hour'] = df['datetime'].dt.hour

3.1 Make sure all 3 new columns are 'object' datatype so they can be one-hot encoded later.

In [34]:
# Convert new columns to object data type
df['month'] = df['month'].astype('object')
df['day_of_week'] = df['day_of_week'].astype('object')
df['hour'] = df['hour'].astype('object')

In [35]:
df.head()

Unnamed: 0,datetime,season,holiday,workingday,weather,temp,atemp,humidity,windspeed,count,month,day_of_week,hour
0,2011-01-01 00:00:00,1,0,0,1,9.84,14.395,81,0.0,16,January,Saturday,0
1,2011-01-01 01:00:00,1,0,0,1,9.02,13.635,80,0.0,40,January,Saturday,1
2,2011-01-01 02:00:00,1,0,0,1,9.02,13.635,80,0.0,32,January,Saturday,2
3,2011-01-01 03:00:00,1,0,0,1,9.84,14.395,75,0.0,13,January,Saturday,3
4,2011-01-01 04:00:00,1,0,0,1,9.84,14.395,75,0.0,1,January,Saturday,4


3.2 Drop the 'datetime' and 'season' columns. These are now redundant.

In [11]:
# Drop 'datetime' and 'season' columns
df = df.drop(['datetime', 'season'], axis=1)

### The temperatures in the 'temp' and 'atemp' columns are in Celsius. Use `.apply()` and a Lambda function to convert them to Fahrenheit.

In [13]:
# Define a lambda function to convert Celsius to Fahrenheit
celsius_to_fahrenheit = lambda celsius: (celsius * 9/5) + 32

In [15]:
# Apply the lambda function to 'temp' and 'atemp' columns
df['temp'] = df['temp'].apply(celsius_to_fahrenheit)
df['atemp'] = df['atemp'].apply(celsius_to_fahrenheit)

### Create a new column, 'temp_variance,' which shows how much warmer or colder the current temperature ('temp') is than the average temperate for that day of the year ('atemp'). If the current temperature is warmer than average ('atemp'), the value in 'temp_variance' should be positive.

In [18]:
# Calculate the average temperature for each day of the year
average_temp_by_day = df.groupby(['month', 'day_of_week', 'hour'])['temp'].transform('mean')

In [20]:
# Create the 'temp_variance' column
df['temp_variance'] = df['temp'] - average_temp_by_day

1. Drop the 'atemp' column

In [24]:
# Drop the 'atemp' column
df.drop(columns=['atemp'], inplace=True)

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10886 entries, 0 to 10885
Data columns (total 11 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   holiday        10886 non-null  int64  
 1   workingday     10886 non-null  int64  
 2   weather        10886 non-null  int64  
 3   temp           10886 non-null  float64
 4   humidity       10886 non-null  int64  
 5   windspeed      10886 non-null  float64
 6   count          10886 non-null  int64  
 7   month          10886 non-null  object 
 8   day_of_week    10886 non-null  object 
 9   hour           10886 non-null  object 
 10  temp_variance  10886 non-null  float64
dtypes: float64(3), int64(5), object(3)
memory usage: 935.6+ KB
