In [None]:
""" Q1. What is the difference between Ordinal Encoding and Label Encoding? Provide an example of when you
might choose one over the other. """

Ans:
    Ordinal encoding and label encoding are both techniques used to represent categorical variables numerically. However, there are key differences between them in terms of the nature of the variable and the information they capture.

Ordinal Encoding:
1.Ordinal encoding assigns a unique numerical value to each category based on their relative order or rank.

2.The assigned numbers carry ordinal information, implying a specific order or hierarchy among the categories.

3.It is suitable for variables where there is a meaningful order or ranking among the categories.

4.Example: Let's say you have a variable representing education level with categories "High School," "Bachelor's Degree," "Master's Degree," and "Ph.D." You can assign numerical codes 1, 2, 3, and 4, respectively, based on the increasing level of education.

Label Encoding:
1.Label encoding assigns a unique numerical value to each category without any inherent ordering or hierarchy.

2.The assigned numbers are arbitrary and do not imply any specific order or rank among the categories.

3.It is suitable for variables where there is no meaningful order among the categories.

4.Example: Consider a variable representing different colors with categories "Red," "Blue," and "Green." You can assign numerical codes 1, 2, and 3 to represent these categories, without implying any order or rank.

In [None]:
""" Q2. Explain how Target Guided Ordinal Encoding works and provide an example of when you might use it in
a machine learning project. """

Ans:
    
Target Guided Ordinal Encoding is a technique used to encode categorical variables by creating ordinal mappings based on the target variable. It leverages the relationship between the categorical variable and the target variable to assign numerical codes to each category, aiming to capture the target-related information in the encoding.

Here's how Target Guided Ordinal Encoding works:

Calculate the mean or median of the target variable for each category of the categorical variable. Sort the categories based on their mean or median target values. Assign ordinal numerical codes to the categories, starting from 1 for the category with the lowest mean or median target value and incrementing the code for each subsequent category.

Example of when to use Target Guided Ordinal Encoding:

Suppose you are working on a churn prediction project for a telecommunications company, and one of the categorical variables in your dataset is "Subscription Type" with categories like "Prepaid," "Postpaid," and "Corporate." You want to encode this variable in a way that reflects the likelihood of churn

In [None]:
""" Q3. Define covariance and explain why it is important in statistical analysis. How is covariance calculated? """

Ans:
    Covariance is a statistical measure that quantifies the relationship between two random variables. It indicates how changes in one variable are associated with changes in another variable. Specifically, covariance measures the extent to which two variables vary together.

Importance of Covariance in Statistical Analysis:

1.Relationship Assessment: Covariance helps in understanding the nature of the relationship between two variables. If the covariance is positive, it indicates a positive relationship, meaning that as one variable increases, the other tends to increase as well. A negative covariance suggests a negative relationship, indicating that as one variable increases, the other tends to decrease.

2.Direction and Magnitude: Covariance not only reveals the direction of the relationship but also provides a measure of the strength or magnitude of the relationship. A larger positive or negative covariance indicates a stronger association between the variables.

3.Multivariate Analysis: Covariance is crucial in multivariate analysis, where multiple variables are involved. It helps in determining the interdependencies and associations among multiple variables simultaneously.

Covariance is calculated using the following formula:

        cov(X, Y) = Σ[(Xᵢ - μₓ)(Yᵢ - μᵧ)] / (n - 1)

where:
cov(X, Y) represents the covariance between variables X and Y.
Xᵢ and Yᵢ are the individual data points of X and Y, respectively.
μₓ and μᵧ are the means of X and Y, respectively.
n is the number of data points.

In [None]:
""" Q4. For a dataset with the following categorical variables: Color (red, green, blue), Size (small, medium,
large), and Material (wood, metal, plastic), perform label encoding using Python's scikit-learn library.
Show your code and explain the output. """

In [4]:
import pandas as pd
categorical_values ={
    'Color':['red','green','blue'],
    'Size':['small','medium','large'],
    'Material':['wood','metal','plastic']
    }
df = pd.DataFrame(categorical_values)
df

Unnamed: 0,Color,Size,Material
0,red,small,wood
1,green,medium,metal
2,blue,large,plastic


In [12]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
color = le.fit_transform(df["Color"])
size = le.fit_transform(df["Size"])
material = le.fit_transform(df["Material"])
numerical = {
                'Color' : color,
                'Size' : size,
                'Material' : material
            }
pd.DataFrame(numerical)

Unnamed: 0,Color,Size,Material
0,2,2,2
1,1,1,0
2,0,0,1


In [None]:
""" Q5. Calculate the covariance matrix for the following variables in a dataset: Age, Income, and Education
level. Interpret the results. """

In [17]:
import numpy as np
import pandas as pd
np.random.seed(501)
n = 500
age = np.random.randint(low = 21,high = 65,size = n)
income = np.random.randint(low = 1000000 , high = 10000000 ,size = n)
education = np.random.choice(["Bachelor's_Degree","Masters","PHD"],size = n)

In [29]:
dic = {
        'Age' : age,
        'Income' : income,
        'Education' : education
      }
df = pd.DataFrame(dic)
df.head()

Unnamed: 0,Age,Income,Education
0,44,5618622,Bachelor's_Degree
1,52,3495641,PHD
2,60,3054156,Masters
3,36,9512448,PHD
4,53,2218638,PHD


In [30]:
from sklearn.preprocessing import OrdinalEncoder
oe = OrdinalEncoder(categories = [["Bachelor's_Degree","Masters","PHD"]])
oe_encode = oe.fit_transform(df[["Education"]])
df["Education"] = np.ravel(oe_encode)

In [31]:
df.head()

Unnamed: 0,Age,Income,Education
0,44,5618622,0.0
1,52,3495641,2.0
2,60,3054156,1.0
3,36,9512448,2.0
4,53,2218638,2.0


In [32]:
df.cov()

Unnamed: 0,Age,Income,Education
Age,159.0352,-1224692.0,0.422846
Income,-1224692.0,6691210000000.0,37602.354709
Education,0.4228457,37602.35,0.657315


In [None]:
""" Q6. You are working on a machine learning project with a dataset containing several categorical
variables, including "Gender" (Male/Female), "Education Level" (High School/Bachelor's/Master's/PhD),
and "Employment Status" (Unemployed/Part-Time/Full-Time). Which encoding method would you use for
each variable, and why? """

Ans:
For the categorical variables "Gender", "Education Level", and "Employment Status" in a machine learning project, there are different encoding methods that could be used depending on the specific algorithm and data preprocessing requirements. Here are some encoding methods that could be used for each variable:

1.Gender: One-Hot Encoding is a good choice for the "Gender" variable because there are only two possible values (Male and Female). One-Hot Encoding creates a binary column for each possible value, where a 1 indicates the presence of that value and 0 indicates its absence. This method is particularly useful when the categorical variable has no order or hierarchy between its possible values.

2.Education Level: Ordinal Encoding or Label Encoding could be used for the "Education Level" variable since there is a natural order between the possible values (High School < Bachelor's < Master's < PhD). Ordinal Encoding assigns a numerical value to each category in a way that preserves the order between them, whereas Label Encoding assigns a numerical value arbitrarily. If the order between categories is important for the machine learning algorithm, then Ordinal Encoding would be a better choice.

3.Employment Status: One-Hot Encoding could be used for the "Employment Status" variable since there are three possible values (Unemployed, Part-Time, Full-Time) and no natural order or hierarchy between them. One-Hot Encoding creates a binary column for each possible value, where a 1 indicates the presence of that value and 0 indicates its absence. This method is particularly useful when the categorical variable has no order or hierarchy between its possible values.

In [None]:
""" Q7. You are analyzing a dataset with two continuous variables, "Temperature" and "Humidity", and two
categorical variables, "Weather Condition" (Sunny/Cloudy/Rainy) and "Wind Direction" (North/South/
East/West). Calculate the covariance between each pair of variables and interpret the results. """

In [35]:
import numpy as np
import pandas as pd
np.random.seed(501)
n = 900
temperature = np.random.normal(loc = 40 , scale = 5, size = n)
humidity = np.random.normal(loc = 50 , scale = 5 , size = n)
weather = np.random.choice(["Sunny","Cloudy","Rainy"],size = n)
wind_direction = np.random.choice(["North","South","East","West"] ,size = n)

In [47]:
dic = {
    'temperature' : temperature,
    'humidity' : humidity,
    'weather' : weather,
    'wind_direction' : wind_direction
    }
df = pd.DataFrame(dic)
df.head()

Unnamed: 0,temperature,humidity,weather,wind_direction
0,40.322043,44.728258,Cloudy,West
1,32.69095,58.728638,Cloudy,East
2,43.661637,43.687795,Rainy,North
3,46.079148,62.036274,Cloudy,South
4,30.375834,50.602288,Sunny,East


In [48]:
df.cov(numeric_only = True)

Unnamed: 0,temperature,humidity
temperature,25.045162,0.940669
humidity,0.940669,24.15425


In [64]:
from sklearn.preprocessing import OneHotEncoder
oe = OneHotEncoder()
encode = oe.fit_transform(df[["weather","wind_direction"]])
oe_encode = pd.DataFrame(encode.toarray(),columns = oe.get_feature_names_out())
oe_encode.head()

Unnamed: 0,weather_Cloudy,weather_Rainy,weather_Sunny,wind_direction_East,wind_direction_North,wind_direction_South,wind_direction_West
0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
1,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,0.0,1.0,0.0,0.0,1.0,0.0,0.0
3,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.0,1.0,1.0,0.0,0.0,0.0


In [65]:
oe_encode.cov()

Unnamed: 0,weather_Cloudy,weather_Rainy,weather_Sunny,wind_direction_East,wind_direction_North,wind_direction_South,wind_direction_West
weather_Cloudy,0.221346,-0.111591,-0.109755,0.005973,-0.005551,-0.003281,0.002859
weather_Rainy,-0.111591,0.223933,-0.112342,-0.003639,0.002704,-0.000692,0.001626
weather_Sunny,-0.109755,-0.112342,0.222097,-0.002335,0.002846,0.003974,-0.004485
wind_direction_East,0.005973,-0.003639,-0.002335,0.17968,-0.052939,-0.056068,-0.070672
wind_direction_North,-0.005551,0.002704,0.002846,-0.052939,0.174875,-0.053943,-0.067993
wind_direction_South,-0.003281,-0.000692,0.003974,-0.056068,-0.053943,0.182023,-0.072012
wind_direction_West,0.002859,0.001626,-0.004485,-0.070672,-0.067993,-0.072012,0.210677
