# Relationship between categorical variables

In this lecture, we will explore the concept of **analyzing relationships between categorical variables** using statistical tests such as the **chi-square (χ²) test and Cramer's V**. 

- **Chi-square Test**:

    - The chi-square test is a statistical test used to **determine whether there is a significant association between two categorical variables**. 
    - Interpretation: if p-value < 0.05), this suggests that there is a significant association between the categorical variables. P-value >= 0.05, indicates that there is not enough evidence to conclude a significant association between the variables. 

- **Cramer's V**:

    - Cramer's V is a measure of association used to **quantify the strength of the relationship between categorical variables**. 
    - Interpretation: Cramer's V ranges from 0 to 1, where **0 indicates no association, and 1 represents a perfect association between the variables**. A higher value of Cramer's V indicates a stronger relationship between the categorical variables.


In our case, we will apply these tests to examine the **relationship between car manufacturers and car prices**. By analyzing this relationship, we can gain insights into how different manufacturers' cars are priced, identify any significant associations, and understand the impact of the manufacturer on the pricing of cars.

*Only for those math lovers:* 
- *Chi-square test: compares the observed frequencies in a contingency table with the expected frequencies under the assumption of independence. By assessing the deviation from expected frequencies, the chi-square test helps us determine if the relationship between variables is statistically significant.* 
- *To interpret the chi-square test results, we consider the p-value and compare it to the significance level (α) chosen for the test. If the p-value is less than α (e.g., p-value < 0.05), we reject the null hypothesis of independence. This suggests that there is a significant association between the categorical variables. If the p-value is greater than or equal to α, we fail to reject the null hypothesis. This indicates that there is not enough evidence to conclude a significant association between the variables.*
- *Cramer's V is derived from the chi-square statistic and takes into account the size of the contingency table.*

## Dataset

The dataset we will look at today is called "Car Price Prediction Dataset." It is a dataset used for predicting the prices of cars based on various features. Here's a brief introduction to the dataset:

The dataset contains information about different cars and their respective attributes. It includes both numerical and categorical variables that describe various aspects of the cars. The goal of this dataset is to predict the price of a car based on its features.

Here are the key features included in the dataset:

1. Manufacturer: The brand or manufacturer of the car.
2. Model: The specific model name of the car.
3. Location: The location where the car is being sold.
4. Year: The manufacturing year of the car.
5. Kilometers_Driven: The total distance driven by the car in kilometers.
6. Fuel_Type: The type of fuel used by the car (Petrol, Diesel, CNG, LPG, Electric).
7. Transmission: The type of transmission system in the car (Manual, Automatic).
8. Owner_Type: The number of previous owners of the car (First, Second, Third, Fourth & Above).
9. Mileage: The fuel efficiency of the car in kilometers per liter.
10. Engine: The engine displacement of the car in cubic centimeters (cc).
11. Power: The maximum power output of the car's engine in bhp (brake horsepower).
12. Seats: The number of seats available in the car.
13. Price: The price of the car in Indian Rupees (INR).

*The dataset is often used for regression analysis and machine learning tasks to develop predictive models for car price estimation. By analyzing the relationships between the car's features and its price, valuable insights can be gained for pricing strategies, market analysis, and decision-making in the automotive industry.*

We won't get into predictive analytics yet, but **let's use EDA to see if the price of a car and the manufacturer (brand) are related.**

In [2]:
import pandas as pd

df = pd.read_csv("car_price_prediction.csv")

In [3]:
df.head()

Unnamed: 0,ID,Price,Levy,Manufacturer,Model,Prod. year,Category,Leather interior,Fuel type,Engine volume,Mileage,Cylinders,Gear box type,Drive wheels,Doors,Wheel,Color,Airbags
0,45654403,13328,1399,LEXUS,RX 450,2010,Jeep,Yes,Hybrid,3.5,186005 km,6.0,Automatic,4x4,04-May,Left wheel,Silver,12
1,44731507,16621,1018,CHEVROLET,Equinox,2011,Jeep,No,Petrol,3.0,192000 km,6.0,Tiptronic,4x4,04-May,Left wheel,Black,8
2,45774419,8467,-,HONDA,FIT,2006,Hatchback,No,Petrol,1.3,200000 km,4.0,Variator,Front,04-May,Right-hand drive,Black,2
3,45769185,3607,862,FORD,Escape,2011,Jeep,Yes,Hybrid,2.5,168966 km,4.0,Automatic,4x4,04-May,Left wheel,White,0
4,45809263,11726,446,HONDA,FIT,2014,Hatchback,Yes,Petrol,1.3,91901 km,4.0,Automatic,Front,04-May,Left wheel,Silver,4


# Chi-square and Cramer's V - categorical variables correlation

We'll use **chi square test to assess the independence between Price and Manufacturer**, and **Cramer's V coefficient to quantify the strength of the association between these two features**.

It's worth noting that in order to perform this analysis, both features need to be categorical. We'll start by dividing the price variable into five distinct ranges to make it categorical/discrete.

In [None]:
# Create price ranges using quantiles into 5 price ranges
df['price_range'] = pd.qcut(df['Price'], q=5)

To calculate chi-square and Cramer's V using Python, we can utilize the scipy library. 


In [9]:
from scipy.stats import chi2_contingency

# Create a contingency table of manufacturer and price
contingency_table = pd.crosstab(df['Manufacturer'], df['price_range'])

# Perform the chi-square test
stat, p, dof, expected = chi2_contingency(contingency_table)

# Print the results
print('Chi-Square Test Results:')
print('Test Statistic:', stat)
print('p-value:', p)
print('Degrees of Freedom:', dof)

Chi-Square Test Results:
Test Statistic: 5010.359543865411
p-value: 0.0
Degrees of Freedom: 256


We'll just look at the p-value right now. 

We just mentioned that *if p-value < 0.05), this suggests that there is a significant association between the categorical variables.*

Based on this result, it seems that **there is a significant relationship between the variables** being tested, as indicated by the very low p-value. 

Now, let's look at the **strength of that relationship**.

In [10]:
from scipy.stats.contingency import association

association(contingency_table, method="cramer")

0.2551736218170179

We mentioned above that *The value of Cramer's V ranges between 0 and 1, where 0 indicates no association and 1 represents a perfect association between the variables.*

Cramer's V value: The calculated value of 0.2551736218170179 indicates a **moderate strength of association between the categorical variables** in the contingency table. It suggests that there is some relationship or dependency between the variables.

# Normality tests: Kolmogorov - Smirnov

In data analysis, it is often important to assess the distributional properties of a dataset since **many statistical techniques assume that the data follow a normal distribution**. 

Normality tests are statistical tests that help us **determine if a given dataset follows a normal distribution** or if it significantly deviates from it. One commonly used normality test is the Kolmogorov-Smirnov test.

Applications:

1. **Assumption Testing**: Normality tests are employed to **assess the assumption of normality in various statistical techniques, such as t-tests, analysis of variance (ANOVA), linear regression, and others**. Violations of normality assumptions may require alternative approaches or data transformations.

2. **Data Exploration**: Normality tests help analysts understand the distributional properties of the data they are working with. This information can **guide the selection of appropriate statistical methods and provide insights into the nature of the variables**.


By conducting the Kolmogorov-Smirnov test, we can gain insights into the distributional properties of the "price" variable and determine if it follows a normal distribution or not.

In [11]:
from scipy.stats import kstest, norm

In [20]:
# extract the property price data
property_price = df['Price']

In [21]:
#perform Kolmogorov-Smirnov test for normality
kstest_result = kstest(property_price, 'norm') #NOT NORMAL
kstest_result

KstestResult(statistic=0.9991162853366435, pvalue=0.0)

In [22]:
# print the test result
if kstest_result.pvalue < 0.05:
    print('The test results indicate that the distribution of car prices is significantly different from a normal distribution.')
else:
    print('The test results indicate that the distribution of car prices is not significantly different from a normal distribution.')

The test results indicate that the distribution of car prices is significantly different from a normal distribution.
