# Fundamentals of Machine Learning

## Variables

1.	** Continuous or quantitative:** variables can take any positive or negative numerical value between a large range. Retail sales amount, insurance claims amounts are examples for continuous variables that can take any number within large ranges. These types of variables are also generally known as numerical variables.

2.	** Discrete or qualitative:** variables can take only particular values: retail store location area, state, city are examples for discrete variables as it can take only one particular value for a store (here store is our object). These types of variables are also known as categorical variables.

# Scales of Measurement
In general, variables can be measured on **four different scales**. 
1. **Mean, median, and mode** are the way to understand the central tendency, that is, the middle point of data distribution. 
2. **Standard deviation, variance, and range** are the most commonly used dispersion measures used to understand the spread of the data.

# Nominal Scale of Measurement
Data are measured at the nominal level when each case is classified into one of a number of discrete categories. This is also called **categorical**, that is, used only for classification. As mean is not meaningful, all that we can do is to **count the number of occurrences of each type and compute the proportion (number of occurrences of each type / total occurrences).**
<br>
**Nomial scale examples**<br>
**Color:** Red, Green, Yellow, etc.<br>
**Gender:** Female, Male<br>


# Ordinal Scale of Measurement
Data are measured on an ordinal scale if the categories imply order. The difference between ranks is consistent in direction and authority, but not magnitude. <br>
**Ordinal scale example**<br>
<br>**Military rank:** Second Lieutenant, First Lieutenant, Captain, Major, Lieutenant Colonel, Colonel, etc.
<br>**Clothing size:** Small, Medium, Large, Extra Large. Etc.
<br>**Class rank in an exam:** 1,2,3,4,5, etc.

# Interval Scale of Measurement
If the differences between values have meanings, the data are measured at the interval scale. 
<br>**Temperature:** 10, 20, 30, 40, etc.
<br>**IQ rating:** 85 - 114, 115 - 129, 130 - 144, 145 – 159, etc.

# Ratio Scale of Measurement
Data measured on a ratio scale have differences that are meaningful, and relate to some
true zero point. This is the most common scale of measurement. <br>
**Weight:** 10, 20, 30, 40, 50, 60, etc.<br>
**Height:** 5, 6, 7, 8, 9, etc.<br>
**Age:** 1, 2, 3, 4, 5, 6, 7, etc.<br>

# Dealing with Missing Data
Missing data can mislead or create problems for analyzing the data. In order to avoid any such issues, you need to impute missing data. There are four most commonly used techniques for data imputation.<br>
- **Delete:** You could simply delete the rows containing missing values. This technique is more suitable and effective when the number of missing value rows count is insignificant **(say < 5%)** compare to the overall record count. You can achieve this using Panda's dropna() function.<br>
- **Replace** with summary such as  the mean, mode, or median for a respective column. For continuous or quantitative variables, either mean/average or mode or median value of the respective column can be used to replace the missing values. Whereas for categorical or qualitative variables, the mode (most frequent) summation technique works better. You can achieve this using Panda's fillna() function.<br>
- **Random replace**: You can also replace the missing values with a randomly picked value from the respective column. This technique would be appropriate where the missing values row count is insignificant.<br>
- **Using predictive model:** This is an advanced technique. Here  you can train a regression model for continuous variables and classification model for categorical variables with the available data and use the model to predict the missing values. 

# Handling Categorical Data
Most of the machine’s learning libraries are designed to work well with numerical variables. So categorical variables in their original form of text description can’t be  directly used for model building. Let’s learn some of the common methods of handling categorical data based on their number of levels. <br>
- **Create dummy variable:** This is a Boolean variable that indicates the presence of a category with the value 1 and 0 for absence. You should create k-1 dummy variables, where k is the number of levels. Scikit-learn provides a useful function ‘One Hot Encoder’ to create a dummy variable for a given categorical variable.
- **Convert to number:** Another simple method is to represent the text description of each level with a number by using the ‘Label Encoder’ function of Scikit-learn. If the number of levels are high (example zip code, state, etc.), then you apply the business logic to combine levels to groups. For example zip code or state can be combined to regions; however, in this method there is a risk of losing critical information. Another method is to combine categories based on similar frequency (new category can be high, medium, low).

## Creating dummy variables

In [2]:
import pandas as pd
from patsy import dmatrices
df = pd.DataFrame({'A': ['high', 'medium', 'low'],
                   'B': [10,20,30]},
                    index=[0, 1, 2])
print (df)

        A   B
0    high  10
1  medium  20
2     low  30


In [3]:
# using get_dummies function of pandas package
df_with_dummies= pd.get_dummies(df, prefix='A', columns=['A'])
print (df_with_dummies)

    B  A_high  A_low  A_medium
0  10       1      0         0
1  20       0      0         1
2  30       0      1         0


## Converting categorical variable to numerics

In [4]:
import pandas as pd
# using pandas package's factorize function
df['A_pd_factorized'] = pd.factorize(df['A'])[0]
# Alternatively you can use sklearn package's LabelEncoder function
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['A_LabelEncoded'] = le.fit_transform(df.A)
print (df)

        A   B  A_pd_factorized  A_LabelEncoded
0    high  10                0               0
1  medium  20                1               2
2     low  30                2               1


# Normalizing Data
A unit or scale of measurement for different variables varies, so an analysis with the raw measurement could be artificially skewed toward the variables with higher absolute values. Bringing all the different types of variable units in the same order of magnitude thus eliminates the potential outlier measurements that would misrepresent the finding
and negatively affect the accuracy of the conclusion. Two broadly used methods for rescaling data are **normalization** and **standardization.**
<br><br>
- **Normalizing data**  can be achieved by Min-Max scaling; the formula is given below, which will scale all numeric values in the range 0 to 1.
![alt text][logo]

[logo]: https://github.com/sara-kassani/Python/blob/master/data/Normalization.JPG?raw=true "Normalization"

## **Note**   
Ensure you remove extreme outliers before applying the above technique as it
can skew the normal values in your data to a small interval.

The standardization technique will transform the variables to have a zero mean and standard deviation of one. The formula for standardization is given below and the outcome is commonly known as z-scores.

![alt text][logo]

[logo]: https://github.com/sara-kassani/Python/blob/master/data/Standardization.JPG?raw=true "Normalization"

Where μ is the mean and σ is the standard deviation. Standardization has often been the preferred method for various analysis as it tells us where each data point lies within its distribution and a rough indication of outliers.

In [7]:
# Normalization and scaling
from sklearn import datasets
import numpy as np
from sklearn import preprocessing
iris = datasets.load_iris()
X = iris.data[:, [2, 3]]
y = iris.target
std_scale = preprocessing.StandardScaler().fit(X)
X_std = std_scale.transform(X)
minmax_scale = preprocessing.MinMaxScaler().fit(X)
X_minmax = minmax_scale.transform(X)
print (('Mean before standardization: petal length={:.1f}, petal width={:.1f}'.format(X[:,0].mean(), X[:,1].mean())))
print (('SD before standardization: petal length={:.1f}, petal width={:.1f}'.format(X[:,0].std(), X[:,1].std())))
print (('Mean after standardization: petal length={:.1f}, petal width={:.1f}'.format(X_std[:,0].mean(), X_std[:,1].mean())))
print (('SD after standardization: petal length={:.1f}, petal width={:.1f}'.format(X_std[:,0].std(), X_std[:,1].std())))
print (('Min value before min-max scaling: patel length={:.1f}, patel width={:.1f}'.format(X[:,0].min(), X[:,1].min())))
print (('Max value before min-max scaling: petal length={:.1f}, petal width={:.1f}'.format(X[:,0].max(), X[:,1].max())))
print (('Min value after min-max scaling: patel length={:.1f}, patel width={:.1f}'.format(X_minmax[:,0].min(), X_minmax[:,1].min())))
print (('Max value after min-max scaling: petal length={:.1f}, petal width={:.1f}'.format(X_minmax[:,0].max(), X_minmax[:,1].max())))

Mean before standardization: petal length=3.8, petal width=1.2
SD before standardization: petal length=1.8, petal width=0.8
Mean after standardization: petal length=0.0, petal width=-0.0
SD after standardization: petal length=1.0, petal width=1.0
Min value before min-max scaling: patel length=1.0, patel width=0.1
Max value before min-max scaling: petal length=6.9, petal width=2.5
Min value after min-max scaling: patel length=0.0, patel width=0.0
Max value after min-max scaling: petal length=1.0, petal width=1.0
