> # **Name : Abhishek Subhash Swami**

> # **Roll No. 13**

---

# Experiment No. 8 : Exploratory Data Analysis II

## One Hot Encoding :
>One hot encoding is a technique that converts categorical variables into a binary vector with one element of the vector representing each category. In this encoding method, each category is converted to a binary vector of length equal to the number of categories in the variable. The vector contains a 1 in the position corresponding to the category and a 0 in all other positions.

In [60]:
import pandas as pd

df=pd.DataFrame({
    'empid':[1,2,3,4,5,6,7,8,9,10],
    'remark':['nice','great','good','nice','great','good','nice','great','good','nice'],
    'type':['male','female', 'male','female','male','female','male','female','male','female']
    }
)
df2=df
df

Unnamed: 0,empid,remark,type
0,1,nice,male
1,2,great,female
2,3,good,male
3,4,nice,female
4,5,great,male
5,6,good,female
6,7,nice,male
7,8,great,female
8,9,good,male
9,10,nice,female


In [61]:
#OneHotEncoder
from category_encoders import OneHotEncoder

# Instantiate OneHotEncoder
onehot_encoded = OneHotEncoder(cols=['remark']).fit(df).transform(df)
onehot_encoded

Unnamed: 0,empid,remark_1,remark_2,remark_3,type
0,1,1,0,0,male
1,2,0,1,0,female
2,3,0,0,1,male
3,4,1,0,0,female
4,5,0,1,0,male
5,6,0,0,1,female
6,7,1,0,0,male
7,8,0,1,0,female
8,9,0,0,1,male
9,10,1,0,0,female


## Dummy Encoding
>Dummy Encoding is similar to One Hot Encoding, but it drops one of the categories to avoid the "dummy variable trap". In this technique, each category is converted into a binary feature, and each instance is represented as a binary vector with one less feature than the number of categories.

In [62]:
#DummyEncoder

encoded_data=pd.get_dummies(df, columns=['remark','type'])
encoded_data

Unnamed: 0,empid,remark_good,remark_great,remark_nice,type_female,type_male
0,1,0,0,1,0,1
1,2,0,1,0,1,0
2,3,1,0,0,0,1
3,4,0,0,1,1,0
4,5,0,1,0,0,1
5,6,1,0,0,1,0
6,7,0,0,1,0,1
7,8,0,1,0,1,0
8,9,1,0,0,0,1
9,10,0,0,1,1,0


## Label Encoding:
> Label encoding is a technique that converts categorical variables into numerical variables by assigning a unique numerical value to each category. This encoding method assumes an order between categories, and the assigned values reflect this order.

In [63]:
from sklearn.preprocessing import LabelEncoder
encoder=LabelEncoder()
df2['type']=encoder.fit_transform(df2['type'])
df2

Unnamed: 0,empid,remark,type
0,1,nice,1
1,2,great,0
2,3,good,1
3,4,nice,0
4,5,great,1
5,6,good,0
6,7,nice,1
7,8,great,0
8,9,good,1
9,10,nice,0


## Ordinal Encoding
>Ordinal Encoding is similar to Label Encoding, but it assigns numerical labels based on the order of the categories. In this technique, each category is assigned a numerical label based on its position in a specified order, and each instance is represented by its corresponding label.



In [64]:
from sklearn.preprocessing import OrdinalEncoder
# Define the order of the categories
category_order = ['nice', 'good', 'great']

# Create an OrdinalEncoder object with the specified order
encoder = OrdinalEncoder(categories=[category_order])

# Fit and transform the data
df2['remark'] = encoder.fit_transform(df2[['remark']])


In [65]:
df2.astype(int)

Unnamed: 0,empid,remark,type
0,1,0,1
1,2,2,0
2,3,1,1
3,4,0,0
4,5,2,1
5,6,1,0
6,7,0,1
7,8,2,0
8,9,1,1
9,10,0,0


## Target Encoding
>Target Encoding is a technique used to convert categorical data into numerical data by encoding each category based on the target variable. In this technique, each category is assigned a numerical label based on the mean of the target variable for that category, and each instance is represented by its corresponding label.



In [66]:
from category_encoders import TargetEncoder

df3=pd.DataFrame({'Animal':['Cat','Dog','Dog','Cat','Cat','Cat','Dog','Dog','Cat','Dog','Cat'],
                  'Colour':['White','Black','Brown','Gray','White','Black','Brown','Tan','White','Gray','White'],
                  'Age':[3,4,2,5,1,3,4,6,7,4,2]})


# Create a TargetEncoder object
encode=TargetEncoder(cols=['Animal']).fit_transform(df3['Animal'],df3['Age'])
encode

Unnamed: 0,Animal
0,3.682315
1,3.777025
2,3.777025
3,3.682315
4,3.682315
5,3.682315
6,3.777025
7,3.777025
8,3.682315
9,3.777025


## Frequency Encoding
>Frequency Encoding is a technique used to convert categorical data into numerical data by encoding each category based on its frequency in the dataset. In this technique, each category is assigned a numerical label based on the frequency of that category in the dataset, and each instance is represented by its corresponding label.

In [67]:
# Calculate the frequency of each category
freq = df3['Colour'].value_counts(normalize=True)

# Map the frequency values to the categories
data_encoded = df3.copy()
data_encoded['Colour'] = data_encoded['Colour'].map(freq)
data_encoded

Unnamed: 0,Animal,Colour,Age
0,Cat,0.363636,3
1,Dog,0.181818,4
2,Dog,0.181818,2
3,Cat,0.181818,5
4,Cat,0.363636,1
5,Cat,0.181818,3
6,Dog,0.181818,4
7,Dog,0.090909,6
8,Cat,0.363636,7
9,Dog,0.181818,4


## Binary Encoding
>Binary Encoding is a technique used to convert categorical data into numerical data by encoding each category as a binary number. In this technique, each category is assigned a binary number, and each digit represents the presence or absence of a certain feature. For example, if we have three categories, we can assign the binary numbers 001, 010, and 100, and represent each instance as a binary number.

In [68]:
from category_encoders import BinaryEncoder

encode=BinaryEncoder(cols=['Animal']).fit(df3).transform(df3)
encode

Unnamed: 0,Animal_0,Animal_1,Colour,Age
0,0,1,White,3
1,1,0,Black,4
2,1,0,Brown,2
3,0,1,Gray,5
4,0,1,White,1
5,0,1,Black,3
6,1,0,Brown,4
7,1,0,Tan,6
8,0,1,White,7
9,1,0,Gray,4


## Feature Hashing
>Feature Hashing is a technique used to convert categorical data into numerical data by hashing each category into a fixed-length vector. In this technique, each category is hashed into a fixed-length vector, and each instance is represented by the sum of the hashed vectors for its categorical features.

In [69]:
from category_encoders import HashingEncoder

encode=HashingEncoder(cols=['Animal']).fit(df3).transform(df3)
encode

Unnamed: 0,col_0,col_1,col_2,col_3,col_4,col_5,col_6,col_7,Colour,Age
0,0,0,0,0,0,1,0,0,White,3
1,0,0,0,0,0,0,1,0,Black,4
2,0,0,0,0,0,0,1,0,Brown,2
3,0,0,0,0,0,1,0,0,Gray,5
4,0,0,0,0,0,1,0,0,White,1
5,0,0,0,0,0,1,0,0,Black,3
6,0,0,0,0,0,0,1,0,Brown,4
7,0,0,0,0,0,0,1,0,Tan,6
8,0,0,0,0,0,1,0,0,White,7
9,0,0,0,0,0,0,1,0,Gray,4


## Pearson's Coeffient 
>Pearson's correlation coefficient is a statistical measure that measures the linear correlation between two variables. It has a value between -1 and 1, where a value of -1 indicates a perfectly negative linear correlation, 0 indicates no linear correlation, and 1 indicates a perfectly positive linear correlation. The formula for Pearson's correlation coefficient is:

>```r = (nΣxy - ΣxΣy) / sqrt((nΣx^2 - (Σx)^2) * (nΣy^2 - (Σy)^2))```

Where:

* r is the Pearson's correlation coefficient
* n is the number of observations
* Σxy is the sum of the products of each pair of corresponding values of x and y
* Σx and Σy are the sums of the values of x and y, respectively
* Σx^2 and Σy^2 are the sums of the squares of the values of x and y, respectively


In [70]:
hwdata=pd.DataFrame({
    'Height':[140,152,162,178,163,190,173,167,157,180],
    'Weight':[50,58,59,64,54,79,72,75,53,69]
})
hwdata

Unnamed: 0,Height,Weight
0,140,50
1,152,58
2,162,59
3,178,64
4,163,54
5,190,79
6,173,72
7,167,75
8,157,53
9,180,69


In [77]:
#find correlation coefficient

r=hwdata['Height'].corr(hwdata['Weight'])
print("Pearson's correlation coefficient:", r)

if r>=0.5:
  print("Highly positive Correlation")
elif r>0.1 and r<0.5:
  print("Low positive Correlation")
elif r<0.1 and r>-0.5:
  print("Low Negaive Correlation")
elif r<-0.5 and r>=-1:
  print("Highly Negative Correlation")
else:
  print("No Correlation")

Pearson's correlation coefficient: 0.82452239019378
Highly positive Correlation


### Correlation Matrix.

In [80]:
corr_matrix=hwdata.corr()
print(corr_matrix)

          Height    Weight
Height  1.000000  0.824522
Weight  0.824522  1.000000
