### Dataset : IOT Sensor data

### Aim of the project
We are going to apply stats to get list of input variables that are independent of each other and dependent on the target variable1. Dependent variable - Target variable : Vehicle
2. Independent variables - Input variables : DateTime, Junction, ID

In [494]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import time
from datetime import datetime

In [495]:
df_train = pd.read_csv("IOT_train.csv")
df_train.head(3)

Unnamed: 0,DateTime,Junction,Vehicles,ID
0,2015-11-01 00:00:00,1,15,20151101001
1,2015-11-01 01:00:00,1,13,20151101011
2,2015-11-01 02:00:00,1,10,20151101021


In [496]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48120 entries, 0 to 48119
Data columns (total 4 columns):
DateTime    48120 non-null object
Junction    48120 non-null int64
Vehicles    48120 non-null int64
ID          48120 non-null int64
dtypes: int64(3), object(1)
memory usage: 1.5+ MB


###  Convert DateTime column of type object  to  type Date time , to get more info

In [497]:
df_train['DateTime'] = pd.to_datetime(df_train['DateTime'])

In [498]:
df_train['Month'] = df_train.DateTime.dt.month
df_train['Day']= df_train.DateTime.dt.day #Day of month
df_train['Time']= (((df_train.DateTime.dt.hour * 60 )+ df_train.DateTime.dt.minute) * 60 ) + df_train.DateTime.dt.second
df_train['Weekday']= df_train.DateTime.dt.weekday
df_train['Week']= df_train.DateTime.dt.week  #Week of year
df_train['Quarter']= df_train.DateTime.dt.quarter #Date belongs to which quarter of the year. Eg: Oct-Nov-Dec belong to quarter4

### Replacing Date with unixtime list

In [499]:
def datetounix(df):
    # Initialising unixtime list
    unixtime = []
    
    # Running a loop for converting Date to seconds
    for date in df['DateTime']:
        unixtime.append(time.mktime(date.timetuple()))
    
    # Replacing Date with unixtime list
    df['DateTime'] = unixtime
    return(df)

In [500]:
df_train = datetounix(df_train)

In [501]:
df_train.head(2)

Unnamed: 0,DateTime,Junction,Vehicles,ID,Month,Day,Time,Weekday,Week,Quarter
0,1446316000.0,1,15,20151101001,11,1,0,6,44,4
1,1446320000.0,1,13,20151101011,11,1,3600,6,44,4


In [502]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48120 entries, 0 to 48119
Data columns (total 10 columns):
DateTime    48120 non-null float64
Junction    48120 non-null int64
Vehicles    48120 non-null int64
ID          48120 non-null int64
Month       48120 non-null int64
Day         48120 non-null int64
Time        48120 non-null int64
Weekday     48120 non-null int64
Week        48120 non-null int64
Quarter     48120 non-null int64
dtypes: float64(1), int64(9)
memory usage: 3.7 MB


### Convert those features with finite number of possible values to Categorical data.
#### Advantage : Gives more info, uses less memory which leads to performance improvement

Ref: https://pbpython.com/pandas_dtypes_cat.html

In [503]:
# Make Junction, Weekday, Quarter columns as categorical
df_train['Junction'] = df_train['Junction'].astype('category')
df_train['Weekday'] = df_train['Weekday'].astype('category')
df_train['Quarter'] = df_train['Quarter'].astype('category')

df_train.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48120 entries, 0 to 48119
Data columns (total 10 columns):
DateTime    48120 non-null float64
Junction    48120 non-null category
Vehicles    48120 non-null int64
ID          48120 non-null int64
Month       48120 non-null int64
Day         48120 non-null int64
Time        48120 non-null int64
Weekday     48120 non-null category
Week        48120 non-null int64
Quarter     48120 non-null category
dtypes: category(3), float64(1), int64(6)
memory usage: 2.7 MB


### Get list of continuous and categorical data 

In [504]:
continuous_Vars = df_train.select_dtypes(exclude=['object','category']).columns
categorical_Vars = df_train.select_dtypes(include=['object','category']).columns

In [505]:
"""
Here we remove ID variable from continuousVar list
"""
varList =[]
def dropVar(var,continuous_Vars):
    for i in continuous_Vars:
        if var in i:
            continuous_Vars = continuous_Vars.drop(i)
    return continuous_Vars


continuous_Vars = dropVar("ID", continuous_Vars)

In [506]:

continuous_Vars

Index(['DateTime', 'Vehicles', 'Month', 'Day', 'Time', 'Week'], dtype='object')

# Univariate Analysis  :: Continuous variables 

### Aim : To identify any outliers / skewness
##### Techniques that can be used :
1. central tendency measure (or)
2. box plot (or)
3. distibution/hist plot (or)

#### Inference: We can get an idea of outliers i.e whether data is skewed to left/right
1. Data skewed to left : More outliers at lower end of data :: Mean < Median/Mode
2. Data skewed to right : More outliers at higher end of data :: Mean > Median/Mode

In [507]:
"""
Note:  Here we are not considering Continuous variable
"""
def univar_analysis_continuous(continuous_Vars):
    for col in continuous_Vars:
        if(df_train[col].mean() > df_train[col].median()):
            print(col , " is skewed to right i.e. More outliers at higher values")
        elif (df_train[col].mean() < df_train[col].median()):
            print(col, " column is skewed to left i.e. More outliers at lower values")
        #plt.boxplot(df1_train['col'])
        #plt.hist(df_train[col])
            
univar_analysis_continuous(continuous_Vars[:])


DateTime  column is skewed to left i.e. More outliers at lower values
Vehicles  is skewed to right i.e. More outliers at higher values
Month  is skewed to right i.e. More outliers at higher values
Day  column is skewed to left i.e. More outliers at lower values
Week  is skewed to right i.e. More outliers at higher values


### TODO : These are population data. We need to take samples( of sample size > 30) and check if the sample distribution is normal.  Hypothesis Test can be performed on these samples using z-test or t-test 

# Univariate Analysis  :: Categorical variables 

### Aim: To identify count /  freq distribution
#### Techniques that can be used :
1. Frequence table - Crosstab : can be used for continuous and categorical data types
2. Bar plot
3. Count plot - seaborn
4. value_counts()

Ref : https://www.kaggle.com/hamelg/python-for-data-19-frequency-tables

In [508]:
def univar_analysis_categorical(obj_col_list):
    for i in obj_col_list:
        freq = pd.crosstab(index=df_train[i], columns='count')
        percent_val= (freq/np.sum(freq))*100
        print("Around ", np.floor(np.max(percent_val['count'])) ,"% data in ",i.upper()," column belongs to the category : ",np.argmax(percent_val['count']))
        # plt.bar(freq.index,freq['count'])
        # sns.countplot(df1_train[i])
    
univar_analysis_categorical(categorical_Vars)

Around  30.0 % data in  JUNCTION  column belongs to the category :  1
Around  14.0 % data in  WEEKDAY  column belongs to the category :  0
Around  31.0 % data in  QUARTER  column belongs to the category :  2


The current behaviour of 'Series.argmax' is deprecated, use 'idxmax'
instead.
The behavior of 'argmax' will be corrected to return the positional
maximum in the future. For now, use 'series.values.argmax' or
'np.argmax(np.array(values))' to get the position of the maximum
row.
  return getattr(obj, method)(*args, **kwds)


# Bi-variate analysis : Test for dependency

### Get variable pairs 
1. Continuous-Continuous Variable pairs
2. Continuous-Categorical Variable pairs
3. Categorical-Categorical Variable pairs

In [509]:
from scipy import stats
def get_ContinuousVar_Pairs(continuous_Var):
    continuousVar_pairs  = []
    for i in range(len(continuous_Var)):
        for j  in range(i+1,len(continuous_Var)):
            continuousVar_pairs.append((continuous_Var[i], continuous_Var[j]))
    return continuousVar_pairs
            
def get_CategoricalVar_Pairs(categorical_Var):
    categoricalVar_pairs  = []
    for i in range(len(categorical_Var)):
        for j  in range(i+1,len(categorical_Var)):
            categoricalVar_pairs.append((categorical_Var[i], categorical_Var[j]))
    return categoricalVar_pairs

            
def get_Categ_Conti_Pairs(categorical_Var,continuous_Var):
    categ_ContiVar_pairs  = []
    for i in range(len(continuous_Var)):
        for j  in range(len(categorical_Var)):
            categ_ContiVar_pairs.append((continuous_Var[i], categorical_Var[j]))
    return categ_ContiVar_pairs

continuousVar_pairs = get_ContinuousVar_Pairs(continuous_Vars)
categoricalVar_pairs = get_CategoricalVar_Pairs(categorical_Vars)
categ_ContVar_pairs = get_Categ_Conti_Pairs(categorical_Vars,continuous_Vars)

In [510]:
#Creating list to store dependent and independent variables
dependent_var = []
independent_var = []

# Dependency test expected results

1. Input variables should be independent of each other
    > If input variables are dependent, we need to consider only one of the two variables
2. Input variables should be dependent on output variable.
    > If input variables is independent on output variable, we should ignore that independent variable while modelling

# Dependency test on Continuous-Categorical Pair

1. Dependent variables : Continuous
2. Independent variables : categorical


#### Eg: Say we  need to find if there is any difference in training hour(Dependent var) of Male and Female , who have taken the same online course. This tells us if there is a dependency between Gender and TrainingHour. Null Hypothesis assumes no dependency  between variables (i.e., No difference in training hours between Male and Female). Alternate Hyp assumes there is dependency

1. H0 : Assume no difference in the training hours between Male(M) and Female(F) , Mean_training_F = Mean_training_M
2. H1 : There is difference. Mean_training_F ≠ Mean_training_M
    
## Anova : Analysis of Variance test
 
#### Steps done internally by stats.f_oneway

1. Calculate within group and between groups variability
2. Calculate F-Ratio
3. Calculate probability using F-table

### What do we infer ?
If the mean values between the categories are very far and if mean value of one category falls within the rejection area(given by critical_p_value = alpha) ,we reject null hypothesis(i.e favour alternate hyp) and say that the mean difference between the categories are significant. i.e. there is significant difference between the categories. With this we conclude that Varaiables are dependent on each other

# Hypothesis
### H0: Variables are independent of each other. 
### H1: Variables are dependent on each other

In [511]:
"""
We apply one-way ANOVA test(stats.f_oneway) on the grouped categories. This results in tuple (f-ratio and p_value)
We set critical_p_value(alpha) as 0.05 - One tailed test

We need to identify 2 Degrees of freedom(DF)::
> Numerator DF(m) = Number of categories -1
> Denominator DF = m * [size of data -1](Each category will have different size. Here we consider max datasize)
We calculate corresponding critical_f_ratio using stats.f.ppf, passing alpha and DF

Here we test the hypothesis by 2 methods- (You can consider one of the techniques)
1. critical_p_value(alpha) vs p_value : If p_value is less than alpha, i.e. one of the categories mean falls within rejection,
area , we reject null hypothesis
2. critical_f_ratio vs f_ratio : If f_ratio is greater than critical_f_ratio, i.e. one of the categories mean falls within rejection,
area , we reject null hypothesis

When we reject null hypothesis => we accept alternate hypothesis and conclude that the variables are dependent and 
add input variables to dependent_var list
Similarly we add independent variables to independent_var list
"""
def anovaTest(categories,contCateg,categColumnsDf):
    
    result = stats.f_oneway(*categories)
    m = len(categColumnsDf.columns)-1
    n = categColumnsDf.shape[0] - 1
    f_ratio = result[0]
    #Calculate p_value from f_ratio : https://www.socscistatistics.com/pvalues/fdistribution.aspx
    p_value = result[1]
    # Calculate f-ratio from p-value(alpha) : http://www.biokin.com/tools/f-critical.html
    critical_p_value = alpha = 0.05 #Rejection area for one-tailed(right) test
    critical_f_ratio = stats.f.ppf(q=1-alpha, dfn=m, dfd= (m*n))
   
    if(f_ratio > critical_f_ratio):
        z_result = "Reject null Hyp"
    else:
        z_result = "Fail to reject null Hyp"


    if(p_value <= critical_p_value):
        p_val_result = "Reject null Hyp"
    else:
        p_val_result = "Fail to reject null Hyp"
        

    if(z_result ==  p_val_result ==  "Fail to reject null Hyp"):
        print(contCateg[0], "and", contCateg[1], ":: INDEPENDENT")
        independent_var.append((contCateg[0],contCateg[1]))
    elif(z_result ==  p_val_result ==  "Reject null Hyp"):
        print("****",contCateg[0], "and", contCateg[1], ":: DEPENDENT *****")
        dependent_var.append((contCateg[0],contCateg[1]))
        #print("Average training hours of male and female are statistically different. Hence, training hour depends on gender")
    else:
        print("Error :: Please check the results")

In [512]:
"""
We take the continuous-categorical pair and group them based on categories.
Apply ANOVA test on the grouped data
"""
def analyse_Conti_Categ(categ_ContVar_pairs):
    for contCateg in categ_ContVar_pairs:
        categVar = df_train[contCateg[1]]
        contVar = df_train[contCateg[0]]
        categ_cont_Df = pd.DataFrame({contCateg[1] : categVar, contCateg[0]: contVar })
        grouped = categ_cont_Df.groupby(contCateg[1]).groups
        newDf = pd.DataFrame.from_dict(grouped, orient='index')
        categColumnsDf = newDf.transpose()
        allCategoriesData =[]
        for i in grouped:
            eachCategoryData= contVar[grouped[i]]
            allCategoriesData.append(eachCategoryData)
        anovaTest(allCategoriesData,contCateg,categColumnsDf)
       
        
analyse_Conti_Categ(categ_ContVar_pairs)

**** DateTime and Junction :: DEPENDENT *****
DateTime and Weekday :: INDEPENDENT
**** DateTime and Quarter :: DEPENDENT *****
**** Vehicles and Junction :: DEPENDENT *****
**** Vehicles and Weekday :: DEPENDENT *****
**** Vehicles and Quarter :: DEPENDENT *****
**** Month and Junction :: DEPENDENT *****
Month and Weekday :: INDEPENDENT
**** Month and Quarter :: DEPENDENT *****
Day and Junction :: INDEPENDENT
Day and Weekday :: INDEPENDENT
Day and Quarter :: INDEPENDENT
Time and Junction :: INDEPENDENT
Time and Weekday :: INDEPENDENT
Time and Quarter :: INDEPENDENT
**** Week and Junction :: DEPENDENT *****
Week and Weekday :: INDEPENDENT
**** Week and Quarter :: DEPENDENT *****


# Dependency test on Categorical-Categorical Pair

Ref: https://courses.lumenlearning.com/wmopen-concepts-statistics/chapter/test-of-independence-2-of-3/

We consider Frequency distribution of data between categorical variables using crosstab function
The frequency of each category for one nominal variable is compared across the categories of the second nominal variable.

If conditional percentages for one category differ from other, this says there is dependency. But do they differ enough to conclude strongly that variables are dependent? For this we go ahead with Hypothesis testing

## Chi-Square Test
Chi-Square is called a test of independence because “no relationship” means “independent.” If there is a relationship between the two variables in the population, then they are dependent.

# # Hypothesis
### H0: Variables are  independent of each other i.e no relationship between categories.
### H1: Variables are dependent on each other i.e there exists relationship between categories

In [513]:
"""
Here we obtained observed count and expected count from the distribution table
If the difference between observed count and expected count is large, we conclude that the variables are dependent on each other
If observed count falls under rejection region, we reject null hypothesis and coclude that the variables are dependent

Degrees Of freedom = (number of categories in categoricalVar1 – 1) × (number of response categories in categoricalVar2 – 1)

We test the hypothesis by 2 methods- (You can consider one of the techniques)
1. critical_p_value(alpha) vs p_value : If p_value is less than alpha, i.e. one of the categories mean falls within rejection,
area , we reject null hypothesis
2. critical_f_ratio vs f_ratio : If f_ratio is greater than critical_f_ratio, i.e. one of the categories mean falls within rejection,
area , we reject null hypothesis

When we reject null hypothesis => we accept alternate hypothesis and conclude that the variables are dependent and 
add input variables to dependent_var list
Similarly we add independent variables to independent_var list
"""
from scipy import stats
def chisq(pair,var0,var1):
    var0_count = len(pair.index)-1
    var1_count = len(pair.columns)-1
    observedDf = pair.iloc[:var0_count,:var1_count]
    expected = np.outer(pair['Total'][:var0_count], pair.loc['Total'][:var1_count])/pair.iloc[var0_count,var1_count]
    expectedDf = pd.DataFrame(expected, index=observedDf.index, columns= observedDf.columns)
    chisq_value = (np.square(observedDf - expectedDf)/expectedDf).sum().sum()
    df = (len(df_train[var0].value_counts().index)-1) *(len(df_train[var1].value_counts().index)-1)
    alpha = critical_p_value =  0.05
    critical_Chisq_value = stats.chi2.ppf(q = 0.95,df = df) 
   
    p_value = 1 - stats.chi2.cdf(x=chisq_value,  # Find the p-value
                             df=df)
    
    if(chisq_value > critical_Chisq_value):
        z_result = "Reject null Hyp"
    else:
        z_result = "Fail to reject null Hyp"
   
    if(p_value <= critical_p_value):
        p_val_result = "Reject null Hyp"
    else:
        p_val_result = "Fail to reject null Hyp"
    
    if(z_result ==  p_val_result ==  "Fail to reject null Hyp"):
        print(var0, "and", var1, " :: INDEPENDENT")
        independent_var.append((var0,var1))
    elif(z_result ==  p_val_result ==  "Reject null Hyp"):
        print("****",var0, "and", var1, " :: DEPENDENT")
        dependent_var.append((var0,var1))
    else:
        print("Error :: Please check the results")


In [514]:

def analyseCateg(categoricalVar_pairs):
    for categVar in categoricalVar_pairs:
        pair = pd.crosstab(index=df_train[categVar[0]], columns=df_train[categVar[1]], margins=True,margins_name='Total')
        chisq(pair,categVar[0],categVar[1])        
        
analyseCateg(categoricalVar_pairs)


Junction and Weekday  :: INDEPENDENT
**** Junction and Quarter  :: DEPENDENT
Weekday and Quarter  :: INDEPENDENT


# Dependency test on Continuous-Continuous Pair

We calculate correlation between continuous variables

## PearsonR 
Calculates Pearson correlation coefficient and the p-value for testing non-correlation

Larger r-square, better the model

# Hypothesis
### H0: Variables are  independent of each other i.e Slope is zero.
### H1: Variables are dependent on each other 

In [515]:
"""
Here we consider two continuous variables and find the correlation value and pval
If pval < alpha, we reject null hypothesis and conclude the variables are dependent on each other
"""
from scipy.stats import pearsonr
from scipy.stats import linregress
def test_biVariate_continuousVar_DependencyTest(continuousVar_pairs):
    for contVar in continuousVar_pairs: 
        if(contVar[0]!='ID'):
            rval , pval = pearsonr(df_train[contVar[0]],df_train[contVar[1]])
            if(pval < 0.05):
                dependent_var.append((contVar[0],contVar[1]))
                print("****",contVar[0]," and " , contVar[1], ":: DEPENDENT","****")
            else:
                independent_var.append((contVar[0],contVar[1]))
                print(contVar[0]," and " , contVar[1], ":: INDEPENDENT")
            
test_biVariate_continuousVar_DependencyTest(continuousVar_pairs)

**** DateTime  and  Vehicles :: DEPENDENT ****
**** DateTime  and  Month :: DEPENDENT ****
**** DateTime  and  Day :: DEPENDENT ****
DateTime  and  Time :: INDEPENDENT
**** DateTime  and  Week :: DEPENDENT ****
**** Vehicles  and  Month :: DEPENDENT ****
**** Vehicles  and  Day :: DEPENDENT ****
**** Vehicles  and  Time :: DEPENDENT ****
**** Vehicles  and  Week :: DEPENDENT ****
**** Month  and  Day :: DEPENDENT ****
Month  and  Time :: INDEPENDENT
**** Month  and  Week :: DEPENDENT ****
Day  and  Time :: INDEPENDENT
**** Day  and  Week :: DEPENDENT ****
Time  and  Week :: INDEPENDENT


In [516]:
"""
Get all features from dataset. Remove ID column
"""
features = df_train.columns.values
features = features[features != 'ID']

In [517]:
features

array(['DateTime', 'Junction', 'Vehicles', 'Month', 'Day', 'Time',
       'Weekday', 'Week', 'Quarter'], dtype=object)

In [518]:
"""
Aim is to remove input variables that are independent of target variable(Vehicle) from features list
"""
def removeFeatures_IndependentToTarget(target,remove_features,independent_var,features):
    for var in independent_var:
        if(target in var[0]):
            remove_features.append(var[1])
            #print(var[1]," is independent of target. Hence, this var can be removed from features list")
        elif(target in var[1]):
            remove_features.append(var[0])
            #print(var[0]," is independent of target. Hence, this var can be removed from features list")

    for i in remove_features:
        features = np.delete(features, np.where(features == i))

    return features

In [519]:
target = 'Vehicles'
remove_features = ['Vehicles']
features = removeFeatures_IndependentToTarget(target,remove_features,independent_var,features)

In [520]:
print("Original set of input features considered : ",features)

Original set of input features considered :  ['DateTime' 'Junction' 'Month' 'Day' 'Time' 'Weekday' 'Week' 'Quarter']


In [521]:
# Consider each item from  feature list, add all features that are independent to this item
independentFeatures= []
for f in features:
    tempFeatures = []
    tempFeatures.append(f)
    for d in independent_var:
        if(f == d[0]): 
            tempFeatures.append(d[1])
        elif(f == d[1]):
            tempFeatures.append(d[0])
    independentFeatures.append(tempFeatures)       
            
print("Feature list with combinations of independent input variables")
independentFeatures

Feature list with combinations of independent input variables


[['DateTime', 'Weekday', 'Time'],
 ['Junction', 'Day', 'Time', 'Weekday'],
 ['Month', 'Weekday', 'Time'],
 ['Day', 'Junction', 'Weekday', 'Quarter', 'Time'],
 ['Time',
  'Junction',
  'Weekday',
  'Quarter',
  'DateTime',
  'Month',
  'Day',
  'Week'],
 ['Weekday',
  'DateTime',
  'Month',
  'Day',
  'Time',
  'Week',
  'Junction',
  'Quarter'],
 ['Week', 'Weekday', 'Time'],
 ['Quarter', 'Day', 'Time', 'Weekday']]

In [522]:
# Remove the feature containing all combinations i.e same as input feature list

featureIndex = []
for ix,i in enumerate(independentFeatures):
    if(i in features):
        featureIndex.append(ix)
        
def removeInputFeatures(independentFeatures):
    for ix,i in enumerate(independentFeatures):
        if(i in features):
            independentFeatures.pop(ix)
            
for i in range(len(featureIndex)):
    removeInputFeatures(independentFeatures)
    
print("After removing those features that are same as originalset of input features")
independentFeatures

After removing those features that are same as originalset of input features


  """
  # Remove the CWD from sys.path while we load stuff.


[['DateTime', 'Weekday', 'Time'],
 ['Junction', 'Day', 'Time', 'Weekday'],
 ['Month', 'Weekday', 'Time'],
 ['Day', 'Junction', 'Weekday', 'Quarter', 'Time'],
 ['Week', 'Weekday', 'Time'],
 ['Quarter', 'Day', 'Time', 'Weekday']]

In [527]:
"""
featuresForModelling : These features can be considered while creating models.
These are combinations of input features that are independent of each other and dependent on target variable
"""
featuresForModelling = independentFeatures
print("Combinations of input features that are independent of each other and dependent on target variable, which can be considered while modelling")
featuresForModelling

Combinations of input features that are independent of each other and dependent on target variable, which can be considered while modelling


[['DateTime', 'Weekday', 'Time'],
 ['Junction', 'Day', 'Time', 'Weekday'],
 ['Month', 'Weekday', 'Time'],
 ['Day', 'Junction', 'Weekday', 'Quarter', 'Time'],
 ['Week', 'Weekday', 'Time'],
 ['Quarter', 'Day', 'Time', 'Weekday']]