## Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you might choose one over the other.

Ordinal Encoding: Preserves the order of categories (e.g., "low", "medium", "high"). Useful when order matters (e.g., customer satisfaction levels).

Label Encoding: Assigns unique numerical labels to categories without preserving order (e.g., "color": red = 0, blue = 1, green = 2). Suitable for nominal data with no inherent order (e.g., shirt sizes).

## Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in a machine learning project.

Target Guided Ordinal Encoding:

Sorts categories based on the target variable's average value for each category (e.g., average income for different job titles).
Assigns ordinal numbers (ranks) based on the sorted order.
Use case: When the order of categories might be related to the target variable (e.g., predicting house prices based on different neighborhood categories).

## Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated?

Covariance is a statistical measure that indicates the extent to which two variables change together. It's important because it helps us understand:

Direction of relationship: A positive covariance means variables tend to increase/decrease together, while a negative covariance means they move oppositely.

Strength of linear relationship: A larger absolute value of covariance indicates a stronger linear relationship.

Calculation:

Calculate the mean of each variable.

Subtract the mean from each data point for both variables.

Multiply the corresponding deviations from the mean.

Average these products over the dataset.

## Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium, large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library. Show your code and explain the output.

In [1]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

data = {'Color': ['red', 'green', 'blue', 'red', 'green'],
        'Size': ['small', 'medium', 'large', 'medium', 'small'],
        'Material': ['wood', 'metal', 'plastic', 'metal', 'wood']}

df = pd.DataFrame(data)

label_encoder = LabelEncoder()

df['Color_encoded'] = label_encoder.fit_transform(df['Color'])
df['Size_encoded'] = label_encoder.fit_transform(df['Size'])
df['Material_encoded'] = label_encoder.fit_transform(df['Material'])

print("Original dataset:\n", df[['Color', 'Size', 'Material']])
print("\nEncoded dataset:\n", df[['Color_encoded', 'Size_encoded', 'Material_encoded']])

Original dataset:
    Color    Size Material
0    red   small     wood
1  green  medium    metal
2   blue   large  plastic
3    red  medium    metal
4  green   small     wood

Encoded dataset:
    Color_encoded  Size_encoded  Material_encoded
0              2             2                 2
1              1             1                 0
2              0             0                 1
3              2             1                 0
4              1             2                 2


## Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education level. Interpret the results.

In [2]:
import numpy as np
import pandas as pd

data = {'Age': [25, 30, 35, 40, 45],
        'Income': [50000, 60000, 75000, 80000, 90000],
        'EducationLevel': [12, 16, 14, 18, 20]}

df = pd.DataFrame(data)

covariance_matrix = np.cov(df, rowvar=False)

print("Covariance Matrix:\n", covariance_matrix)

Covariance Matrix:
 [[6.25e+01 1.25e+05 2.25e+01]
 [1.25e+05 2.55e+08 4.25e+04]
 [2.25e+01 4.25e+04 1.00e+01]]


## Q6. You are working on a machine learning project with a dataset containing several categorical variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD), and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for each variable, and why?

1. Gender (Male/Female):

Method: Label encoding.

Justification: This is a suitable choice because "Gender" has only two categories with no inherent order, and label encoding efficiently 
assigns unique numerical labels to each category.

2. Education Level (High School/Bachelor's/Master's/PhD):

Method: Ordinal encoding.

Justification: While there are multiple categories, "Education Level" has a clear inherent order. Ordinal encoding preserves this order, potentially benefiting models that can leverage such information (e.g., decision trees).

3. Employment Status (Unemployed/Part-Time/Full-Time):

Method: Label encoding.

Justification: Similar to "Gender," there are a limited number of categories with no inherent order. Label encoding efficiently assigns unique numerical labels while ignoring potential order assumptions.

## Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/ East/West). Calculate the covariance between each pair of variables and interpret the results.

In [4]:
import numpy as np
import pandas as pd

data = {'Temperature': [25, 28, 22, 20, 30],
        'Humidity': [60, 55, 70, 75, 50],
        'Weather Condition': ['Sunny', 'Cloudy', 'Rainy', 'Sunny', 'Rainy'],
        'Wind Direction': ['North', 'South', 'East', 'West', 'North']}

df = pd.DataFrame(data)

continuous_variables = df[['Temperature', 'Humidity']]

covariance_matrix_continuous = np.cov(continuous_variables, rowvar=False)

print("Covariance Matrix for Continuous Variables:\n", covariance_matrix_continuous)

df_encoded = pd.get_dummies(df, columns=['Weather Condition', 'Wind Direction'])

all_variables = df_encoded[['Temperature', 'Humidity', 'Weather Condition_Sunny', 'Weather Condition_Cloudy', 'Weather Condition_Rainy', 'Wind Direction_North', 'Wind Direction_South', 'Wind Direction_East', 'Wind Direction_West']]

covariance_matrix_all = np.cov(all_variables, rowvar=False)

print("\nCovariance Matrix for All Variables:\n", covariance_matrix_all)

Covariance Matrix for Continuous Variables:
 [[ 17.  -42.5]
 [-42.5 107.5]]

Covariance Matrix for All Variables:
 [[ 1.700e+01 -4.250e+01 -1.250e+00  7.500e-01  5.000e-01  1.250e+00
   7.500e-01 -7.500e-01 -1.250e+00]
 [-4.250e+01  1.075e+02  2.750e+00 -1.750e+00 -1.000e+00 -3.500e+00
  -1.750e+00  2.000e+00  3.250e+00]
 [-1.250e+00  2.750e+00  3.000e-01 -1.000e-01 -2.000e-01  5.000e-02
  -1.000e-01 -1.000e-01  1.500e-01]
 [ 7.500e-01 -1.750e+00 -1.000e-01  2.000e-01 -1.000e-01 -1.000e-01
   2.000e-01 -5.000e-02 -5.000e-02]
 [ 5.000e-01 -1.000e+00 -2.000e-01 -1.000e-01  3.000e-01  5.000e-02
  -1.000e-01  1.500e-01 -1.000e-01]
 [ 1.250e+00 -3.500e+00  5.000e-02 -1.000e-01  5.000e-02  3.000e-01
  -1.000e-01 -1.000e-01 -1.000e-01]
 [ 7.500e-01 -1.750e+00 -1.000e-01  2.000e-01 -1.000e-01 -1.000e-01
   2.000e-01 -5.000e-02 -5.000e-02]
 [-7.500e-01  2.000e+00 -1.000e-01 -5.000e-02  1.500e-01 -1.000e-01
  -5.000e-02  2.000e-01 -5.000e-02]
 [-1.250e+00  3.250e+00  1.500e-01 -5.000e-02 -1.000e

In [5]:
#

In [6]:
#