# Correlation and Relationship analysis
by Smahi

## Scope
Correlations and Relationships:
- a. Is there a correlation between age and monthly income?
- b. Do marital status and family size have any relationship?
- c. Is there a relationship between educational qualifications and occupation?
- d. Are there correlations between the Output (online food orders) and any other variables?

## Summary
- There is a positive relationship between `Age` and `Monthly_income`.


- Marital status and its relationship with family size.

       Married:
        - Most individuals who are married have family sizes ranging from 2 to 6.
        - The most common family size for married individuals is 3, followed by 2 and 4.

       Prefer not to say:
        - Individuals who prefer not to disclose their marital status have smaller family sizes.
        - The most common family size for this group is 2.

       Single:
        - Singles have a diverse range of family sizes, but the most common family size is 3, followed by 2 and 4.
        - There are more single individuals with larger families (e.g., size 4, 5, 6) compared to the "Married" group.

- Occupation and its relationship with education

    **Occupation**
    
        Employee:
        - Most employees have a graduate-level education, with a substantial number having post-graduate qualifications as well.
        - There are no employees with only school or uneducated qualification.

        Housewife:
        - Housewives in this dataset have varying educational backgrounds.
        - The majority of housewives have a graduate or post-graduate education.

        Self-Employed:
        - Self-employed individuals have diverse educational backgrounds.
        - A significant number of self-employed individuals have graduate or post-graduate qualifications, but there are also some with school-level education.

        Student:
        - Students, as expected, are predominantly found in the "Student" occupation category.
        - Most students have either a graduate or post-graduate education.

    **Education**
    
        Graduate:
        - Graduates are spread across various occupations, with a notable presence in employee and student categories.

        Ph.D.:
        - Individuals with a Ph.D. are mainly employed as employees or students, with a small representation in the self-employed category.

        Post Graduate:
        - Post-graduates are found in all occupations, with a significant number in the student category.

        School:
        - Those with a school-level education are only found in the housewife and self-employed categories.

        Uneducated:
        - Individuals categorized as "Uneducated" are found in housewife and self-employed categories, with very small numbers

- The chi-square tests indicate that "Feedback," "Marital_status," and "Occupation" are strongly associated with online food orders ("Output"). 
- The "Education" variable also shows some evidence of association, but to a lesser extent. 
- The "Gender" variable, based on the p-value, does not seem to be significantly related to online food orders in this analysis

## Imports

In [14]:
import pandas as pd
from scipy.stats import chi2_contingency
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# Load the dataset
df = pd.read_csv('C:/Users/SMAHI/Desktop/Online-food-delivery/Data/clean_data.csv')

In [3]:
# Preview
df.head()

Unnamed: 0,Age,Gender,Marital_status,Occupation,Monthly_income,Education,Family_size,latitude,longitude,Pin_code,Output,Feedback
0,20,Female,Single,Student,No Income,Post Graduate,4,12.9766,77.5993,560001,Yes,Positive
1,24,Female,Single,Student,Below Rs.10000,Graduate,3,12.977,77.5773,560009,Yes,Positive
2,22,Male,Single,Student,Below Rs.10000,Post Graduate,3,12.9551,77.6593,560017,Yes,Negative
3,22,Female,Single,Student,No Income,Graduate,6,12.9473,77.5616,560019,Yes,Positive
4,22,Male,Single,Student,Below Rs.10000,Post Graduate,4,12.985,77.5533,560010,Yes,Positive


In [4]:
# Shape
df.shape

(388, 12)

### a. Is there a correlation between age and monthly income?

In [5]:
# Different types of monthly_income
df.Monthly_income.unique()

array(['No Income', 'Below Rs.10000', 'More than 50000', '10001 to 25000',
       '25001 to 50000'], dtype=object)

In [6]:
# Convert Monthly_income to numerical values for correlation analysis
income_mapping = {'No Income': 0, 'Below Rs.10000': 1, 'More than 50000':2, '10001 to 25000':3, '25001 to 50000':4}
df['Monthly_income_numeric'] = df['Monthly_income'].map(income_mapping)


In [18]:
age_income_corr = df['Age'].corr(df['Monthly_income_numeric'])
print(f"Correlation between age and monthly income: {age_income_corr}")

Correlation between age and monthly income: 0.4559379304686674


The correlation coefficient between age and monthly income is 0.4559. This positive correlation suggests a moderate, positive relationship between age and monthly income in this dataset. 

### b. Do marital status and family size have any relationship?

In [25]:
marital_family_relation = pd.crosstab(df['Marital_status'], df['Family_size'])
print(f"Relationship between marital status and family size:\n{marital_family_relation}")

Relationship between marital status and family size:
Family_size         1   2   3   4   5   6
Marital_status                           
Married             5  20  27  16  23  17
Prefer not to say   3   4   2   1   2   0
Single             16  77  88  46  29  12


Married:
- Most individuals who are married have family sizes ranging from 2 to 6.
- The most common family size for married individuals is 3, followed by 2 and 4.

Prefer not to say:
- Individuals who prefer not to disclose their marital status have smaller family sizes.
- The most common family size for this group is 2.

Single:
- Singles have a diverse range of family sizes, but the most common family size is 3, followed by 2 and 4.
- There are more single individuals with larger families (e.g., size 4, 5, 6) compared to the "Married" group.

### c. Is there a relationship between educational qualifications and occupation?

In [16]:
edu_occ_relation = pd.crosstab(df['Education'], df['Occupation'])
print(f"Relationship between educational qualifications and occupation:\n{edu_occ_relation}")

Relationship between educational qualifications and occupation:
Occupation     Employee  House wife  Self Employeed  Student
Education                                                   
Graduate             68           3              29       77
Ph.D                 12           0               3        8
Post Graduate        38           0              14      122
School                0           5               7        0
Uneducated            0           1               1        0


**Occupation**:

Employee:
- Most employees have a graduate-level education, with a substantial number having post-graduate qualifications as well.
- There are no employees with only a school or uneducated in this dataset.

Housewife:
- Housewives in this dataset have varying educational backgrounds.
- The majority of housewives have a graduate or post-graduate education.

Self-Employed:
- Self-employed individuals have diverse educational backgrounds.
- A significant number of self-employed individuals have graduate or post-graduate qualifications, but there are also some with school-level education.

Student:
- Students, as expected, are predominantly found in the "Student" occupation category.
- Most students have either a graduate or post-graduate education.


**Education**:

Graduate:
- Graduates are spread across various occupations, with a notable presence in employee and student categories.

Ph.D.:
- Individuals with a Ph.D. are mainly employed as employees or students, with a small representation in the self-employed category.

Post Graduate:
- Post-graduates are found in all occupations, with a significant number in the student category.

School:
- Those with a school-level education are only found in the housewife and self-employed categories.

Uneducated:
- Individuals categorized as "Uneducated" are found in housewife and self-employed categories, with very small numbers

### d. Are there correlations between the Output (online food orders) and any other variables?

In [15]:
categorical_variables = ['Gender', 'Marital_status', 'Occupation', 'Education', 'Feedback']

output_corr = {}
for variable in categorical_variables:
    contingency_table = pd.crosstab(df['Output'], df[variable])
    chi2, p, _, _ = chi2_contingency(contingency_table)
    output_corr[variable] = p

output_corr = pd.Series(output_corr).sort_values()
print(f"Chi-square p-values for independence between Output and other categorical variables:\n{output_corr}")

e. Chi-square p-values for independence between Output and other categorical variables:
Feedback          1.100291e-30
Marital_status    1.358086e-07
Occupation        2.378679e-07
Education         3.757604e-02
Gender            5.751216e-01
dtype: float64


**Null Hypothesis - Other categorical variables are independent to Output variable.**
- The chi-square tests indicate that "Feedback," "Marital_status," and "Occupation" are strongly associated with online food orders ("Output"). 
- The "Education" variable also shows some evidence of association, but to a lesser extent. 
- The "Gender" variable, based on the p-value, does not seem to be significantly related to online food orders in this analysis