## Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

Label Encoding assigns a unique integer value to each category, while Ordinal Encoding assigns a value based on the order.

If the categories have no inherent order, such as colors or types of fruits, then Label Encoding is appropriate. If the categories have an inherent order, such as low, medium, and high or small, medium, and large, then Ordinal Encoding is appropriate.

if we have a categorical variable "Size" with three categories: Small, Medium, and Large. Using Label Encoding, we could assign Small=0, Medium=1, and Large=2. Using Ordinal Encoding, we could assign Small=1, Medium=2, and Large=3 based on their order.

## Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Target Guided Ordinal Encoding is a feature engineering technique used to encode categorical variables by creating a new ordinal variable based on the relationship between the categorical variable and the target variable. 

The basic idea behind Target Guided Ordinal Encoding is to calculate the mean of the target variable for each category of the categorical variable and then replace the category labels with the corresponding mean values. 

if we have a dataset of customer transactions, and we want to predict which customers are likely to churn. One of the features in the dataset is the "product_category" variable, which has 100 different categories. One-hot encoding would result in 100 new columns, which could cause issues with the dimensionality of the data. Label encoding would assign arbitrary numerical values to each category, which would not capture any meaningful relationship between the categories and the target variable.

Therefore, target Guided Ordinal Encoding can be used to encode the "product_category" variable by replacing each category label with the mean churn rate for that category. This way, the encoding would not only capture the categorical information but also the relationship between the product category and the likelihood of churn. This encoded variable could then be used in a machine learning model to predict customer churn.

## Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance measures the relationship between two variables. A positive covariance indicates that the variables tend to increase or decrease together, while a negative covariance indicates that the variables tend to vary in opposite directions.

Covariance is an important concept in statistical analysis because it helps us to understand the relationship between two variables. It is often used in exploratory data analysis to identify patterns in the data and to determine the strength and direction of the linear association between variables. It is also a key component in many statistical techniques, such as regression analysis, which uses covariance to estimate the parameters of the model.

## Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [5]:
import pandas as pd

df = pd.DataFrame({
    'color':['red','green','blue'],
    'size':['small','medium','large'],
    'material':['wood','meta','plastic']
})
df

Unnamed: 0,color,size,material
0,red,small,wood
1,green,medium,meta
2,blue,large,plastic


In [10]:
from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
df['color_encoded'] = encoder.fit_transform(df['color'])
df['size_encoded'] = encoder.fit_transform(df['size'])
df['material_encoded'] = encoder.fit_transform(df['material'])
df

Unnamed: 0,color,size,material,color_encoded,size_encoded,material_encoded
0,red,small,wood,2,2,2
1,green,medium,meta,1,1,0
2,blue,large,plastic,0,0,1


The LabelEncoder() function has encoded each unique category in the categorical variables "color", "size", and "material" with a unique integer, which can be useful for machine learning models that require numeric input.

## Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [39]:
df = pd.read_csv("adult-income-dataset/adult.csv")

In [40]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'educational-num',
       'marital-status', 'occupation', 'relationship', 'race', 'gender',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'income'],
      dtype='object')

In [41]:
df = df[['age','education','income']]

In [42]:
df['education'].unique()

array(['11th', 'HS-grad', 'Assoc-acdm', 'Some-college', '10th',
       'Prof-school', '7th-8th', 'Bachelors', 'Masters', 'Doctorate',
       '5th-6th', 'Assoc-voc', '9th', '12th', '1st-4th', 'Preschool'],
      dtype=object)

In [43]:
df.describe()

Unnamed: 0,age
count,48842.0
mean,38.643585
std,13.71051
min,17.0
25%,28.0
50%,37.0
75%,48.0
max,90.0


In [51]:
def income_modifier(data):
    data = data.replace("<=","40000").replace(">=","60000").replace("=","").replace("K","").replace("<","40000").replace(">","60000")
    return int(data[:-2])
df['income_mod'] = df['income'].apply(income_modifier)

In [52]:
df.head()

Unnamed: 0,age,education,income,income_mod
0,25,11th,<=50K,40000
1,38,HS-grad,<=50K,40000
2,28,Assoc-acdm,>50K,60000
3,44,Some-college,>50K,60000
4,18,Some-college,<=50K,40000


### ordinal encoding for the column education

In [53]:
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories = [['Assoc-voc',
 'Preschool',
 '1st-4th',
 '5th-6th',
 '7th-8th',
 '9th',
 '10th',
 '11th',
 '12th',
 'HS-grad',
 'Some-college',
 'Assoc-acdm',
 'Prof-school',
 'Bachelors',
 'Masters',
 'Doctorate']])

In [54]:
df['education_encoded'] = pd.DataFrame(encoder.fit_transform(df[['education']]))

In [55]:
df.head()

Unnamed: 0,age,education,income,income_mod,education_encoded
0,25,11th,<=50K,40000,7.0
1,38,HS-grad,<=50K,40000,9.0
2,28,Assoc-acdm,>50K,60000,11.0
3,44,Some-college,>50K,60000,10.0
4,18,Some-college,<=50K,40000,10.0


In [57]:
df.cov()

  df.cov()


Unnamed: 0,age,income_mod,education_encoded
age,187.978083,26951.3,0.674024
income_mod,26951.297876,72811890.0,6523.634942
education_encoded,0.674024,6523.635,9.970335


In [58]:
df.corr(method='spearman')

  df.corr(method='spearman')


Unnamed: 0,age,income_mod,education_encoded
age,1.0,0.269433,0.048031
income_mod,0.269433,1.0,0.293919
education_encoded,0.048031,0.293919,1.0


There is a weak positive correlation between age and income, with a correlation coefficient of 0.269. This suggests that as age increases, income tends to increase slightly as well, but the correlation is not very strong.

There is a moderate positive correlation between income and education, with a correlation coefficient of 0.294. This suggests that as income increases, education level tends to increase as well, but again the correlation is not very strong.

There is a very weak positive correlation between age and education, with a correlation coefficient of 0.048. This suggests that there is no meaningful correlation between age and education level.

## Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

Gender: in this category, values cannot be ranked. Therefore, I would use one-hot encoding.
Education level : in this category, since values can be ranked, I would use ordinal encoding.
Employment status - ordinal encoding is an appropriate chooice for this as values can be ranked.

## Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/East/West). Calculate the covariance between each pair of variables and interpret the results.

In [1]:
import pandas as pd

# create a sample dataset with two continuous variables and two categorical variables
data = {'Temperature': [20, 25, 22, 23, 24], 'Humidity': [30, 40, 35, 45, 50], 
        'Weather': ['sunny','cloudy','sunny','cloudy','rainy'], 
        'Wind': ['north', 'south', 'east', 'north', 'south']}
df = pd.DataFrame(data)
df

Unnamed: 0,Temperature,Humidity,Weather,Wind
0,20,30,sunny,north
1,25,40,cloudy,south
2,22,35,sunny,east
3,23,45,cloudy,north
4,24,50,rainy,south


In [5]:
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder()
E = encoder.fit_transform(df[['Weather','Wind']]).toarray()
df_encoded = pd.DataFrame(E, columns=encoder.get_feature_names_out())
df_encoded

Unnamed: 0,Weather_cloudy,Weather_rainy,Weather_sunny,Wind_east,Wind_north,Wind_south
0,0.0,0.0,1.0,0.0,1.0,0.0
1,1.0,0.0,0.0,0.0,0.0,1.0
2,0.0,0.0,1.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0,1.0,0.0
4,0.0,1.0,0.0,0.0,0.0,1.0


In [8]:
df_concat = pd.concat([df, df_encoded], axis = 1)
df_concat

Unnamed: 0,Temperature,Humidity,Weather,Wind,Weather_cloudy,Weather_rainy,Weather_sunny,Wind_east,Wind_north,Wind_south
0,20,30,sunny,north,0.0,0.0,1.0,0.0,1.0,0.0
1,25,40,cloudy,south,1.0,0.0,0.0,0.0,0.0,1.0
2,22,35,sunny,east,0.0,0.0,1.0,1.0,0.0,0.0
3,23,45,cloudy,north,1.0,0.0,0.0,0.0,1.0,0.0
4,24,50,rainy,south,0.0,1.0,0.0,0.0,0.0,1.0


In [9]:
df_concat.cov()

  df_concat.cov()


Unnamed: 0,Temperature,Humidity,Weather_cloudy,Weather_rainy,Weather_sunny,Wind_east,Wind_north,Wind_south
Temperature,3.7,11.25,0.6,0.3,-0.9,-0.2,-0.65,0.85
Humidity,11.25,62.5,1.25,2.5,-3.75,-1.25,-1.25,2.5
Weather_cloudy,0.6,1.25,0.3,-0.1,-0.2,-0.1,0.05,0.05
Weather_rainy,0.3,2.5,-0.1,0.2,-0.1,-0.05,-0.1,0.15
Weather_sunny,-0.9,-3.75,-0.2,-0.1,0.3,0.15,0.05,-0.2
Wind_east,-0.2,-1.25,-0.1,-0.05,0.15,0.2,-0.1,-0.1
Wind_north,-0.65,-1.25,0.05,-0.1,0.05,-0.1,0.3,-0.2
Wind_south,0.85,2.5,0.05,0.15,-0.2,-0.1,-0.2,0.3


In [10]:
df_concat.corr(method='spearman')

  df_concat.corr(method='spearman')


Unnamed: 0,Temperature,Humidity,Weather_cloudy,Weather_rainy,Weather_sunny,Wind_east,Wind_north,Wind_south
Temperature,1.0,0.7,0.57735,0.353553,-0.866025,-0.353553,-0.57735,0.866025
Humidity,0.7,1.0,0.288675,0.707107,-0.866025,-0.353553,-0.288675,0.57735
Weather_cloudy,0.57735,0.288675,1.0,-0.408248,-0.666667,-0.408248,0.166667,0.166667
Weather_rainy,0.353553,0.707107,-0.408248,1.0,-0.408248,-0.25,-0.408248,0.612372
Weather_sunny,-0.866025,-0.866025,-0.666667,-0.408248,1.0,0.612372,0.166667,-0.666667
Wind_east,-0.353553,-0.353553,-0.408248,-0.25,0.612372,1.0,-0.408248,-0.408248
Wind_north,-0.57735,-0.288675,0.166667,-0.408248,0.166667,-0.408248,1.0,-0.666667
Wind_south,0.866025,0.57735,0.166667,0.612372,-0.666667,-0.408248,-0.666667,1.0


There is a poor correlation between <br>
    - cloudy weather and north wind<br>
    - cloudy weather and south wind<br>
    - sunny weather and north wind<br>
    
There is a strong negative correlation between<br>
    - temperature and sunny weather<br>
    - temperature and wind in north<br>
    - humidity and sunny weather<br>
    - wind north and wind south<br>
    
There is a moderate negative correlation between<br>
    - temperature and wind in east<br>
    - humidity and cloudy weather<br>
    - humidity and wind in east<br>
    - humidity and wind in north <br>
    - cloudy weather and rainy weather<br>
    - cloudy weather and wind in east<br>
    - rainy weather and sunny weather <br>
    - rainy weather and wind in north<br>
    - wind east and wind north<br>
    - wind east and wind south<br>
There is a strong positive correlation between <br>
    - temperature and humidity 0.7<br>
    - temperature and cloudy weather 0.577<br>
    - temperature and wind in south direction 0.86<br>
    - humidity and rainly weather 0.71<br>
    - humidity and wind in south direction 0.577<br>
    - rainly weather and wind in south direction 0.61<br>
    - sunny weather and wind in east direction 0.61<br>
There is a moderate positive correlation between <br>
    - temperature and rainy weather<br>
    - humidity and cloudy weather    