In [None]:
Q1. Pearson correlation coefficient is a measure of the linear relationship between two variables. Suppose
you have collected data on the amount of time students spend studying for an exam and their final exam
scores. Calculate the Pearson correlation coefficient between these two variables and interpret the result.

Q2. Spearman's rank correlation is a measure of the monotonic relationship between two variables.
Suppose you have collected data on the amount of sleep individuals get each night and their overall job
satisfaction level on a scale of 1 to 10. Calculate the Spearman's rank correlation between these two
variables and interpret the result.

Q3. Suppose you are conducting a study to examine the relationship between the number of hours of
exercise per week and body mass index (BMI) in a sample of adults. You collected data on both variables
for 50 participants. Calculate the Pearson correlation coefficient and the Spearman's rank correlation
between these two variables and compare the results.

Q4. A researcher is interested in examining the relationship between the number of hours individuals
spend watching television per day and their level of physical activity. The researcher collected data on
both variables from a sample of 50 participants. Calculate the Pearson correlation coefficient between
these two variables.

Q5. A survey was conducted to examine the relationship between age and preference for a particular
brand of soft drink. The survey results are shown below:

Age(Years)   Soft drink Preference
    25            Coke
    42            Pepsi
    37         Mountain dew
    19            Coke
    31            Pepsi
    28            Coke

Q6. A company is interested in examining the relationship between the number of sales calls made per day
and the number of sales made per week. The company collected data on both variables from a sample of
30 sales representatives. Calculate the Pearson correlation coefficient between these two variables.


In [6]:
# Sol 1:
'''
Let's assume we have a dataset containing the amount of time (in hours) that students spend studying and their corresponding final exam scores (out of 100). 
To calculate the Pearson correlation coefficient between these two variables, we can use the pearsonr() function from the scipy.stats module in Python.
'''
import numpy as np
from scipy.stats import pearsonr

# Example data
study_time = [5, 10, 15, 20, 25, 30]
exam_scores = [65, 70, 75, 80, 85, 90]

# Calculate Pearson correlation coefficient
corr_coef, p_value = pearsonr(study_time, exam_scores)

print("Pearson correlation coefficient:", corr_coef)

'''
Interpretation:
The Pearson correlation coefficient ranges from -1 to 1,

where,
-1 indicates a perfect negative linear relationship,
0 indicates no linear relationship, 
1 indicates a perfect positive linear relationship. 
'''

Pearson correlation coefficient: 0.9999999999999999


In [7]:
# Sol 2:
'''
Spearman's rank correlation is used to measure the strength and direction of a monotonic relationship between two variables, which means that the relationship between the variables is not necessarily linear, but it could be consistently increasing or decreasing.

To calculate the Spearman's rank correlation between the amount of sleep individuals get each night and their overall job satisfaction level, 

we first need to rank the values of each variable from lowest to highest. 
Then, we assign each value a rank, starting from 1 for the lowest value and continuing in ascending order. 
If there are ties, the ranks are averaged.
'''

import scipy.stats as stats

# amount of sleep individuals get
sleep = [6, 7, 8, 7, 6, 8, 7, 5, 6, 8]

# job satisfaction level on a scale of 1 to 10
job_satisfaction = [4, 7, 6, 8, 3, 9, 5, 2, 1, 7]

# calculate Spearman's rank correlation
rho, pval = stats.spearmanr(sleep, job_satisfaction)

print(f"Spearman's rank correlation coefficient: {rho:.2f}")
print(f"P-value: {pval:.2f}")


Spearman's rank correlation coefficient: 0.81
P-value: 0.00


In [16]:
# Sol 3:
'''
Assuming we have the data for both variables for 50 participants, we can calculate both the Pearson correlation coefficient and the Spearman's rank correlation coefficient using Python.
'''
import csv

# BMI data
BMI = [22.5, 26.7, 31.2, 24.3, 29.1, 27.8, 23.6, 30.2, 25.5, 26.9, 28.1, 23.9, 31.5, 
       30.8, 26.3, 28.7, 24.6, 29.4, 27.5, 30.1, 25.8, 23.1, 28.4, 31.9, 27.6, 24.5, 
       29.7, 26.2, 28.3, 24.1, 30.6, 25.9, 27.4, 29.2, 24.8, 31.1, 26.4, 28.9, 30.4, 
       25.3, 29.6, 27.3, 23.7, 30.3, 25.1, 28.2, 24.2, 29.3, 26.5, 27.2, 31.7, 23.8, 
       30.5]

# Hours of exercise data
hours_exercise = [3, 5, 2, 7, 4, 1, 6, 2, 5, 4, 3, 7, 1, 2, 6, 4, 5, 3, 4, 2, 5, 6, 7, 2, 1, 
                  4, 5, 2, 6, 3, 1, 7, 5, 4, 2, 3, 6, 1, 5, 4, 2, 3, 6, 7, 1, 2, 5, 4, 3, 6]

# Create a list of tuples
data = list(zip(BMI, hours_exercise))

# Write the data to a CSV file
with open('data.csv', 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(["BMI", "Hours of Exercise"])
    for d in data:
        writer.writerow(d)



In [20]:
import csv
from scipy import stats

# Read data from csv file
with open('BMIdata.csv', 'r') as file:
    reader = csv.reader(file)
    data = list(reader)

# Extract BMI and hours of exercise data
BMI = []
hours_exercise = []
for row in data:
    try:
        BMI.append(float(row[0]))
        hours_exercise.append(int(row[1]))
    except ValueError:
        continue

# Check if arrays have the same length
if len(BMI) != len(hours_exercise):
    raise ValueError("Arrays must have the same length")

# Calculate Pearson correlation coefficient and p-value
pearson_corr, pearson_pvalue = stats.pearsonr(BMI, hours_exercise)

# Calculate Spearman's rank correlation coefficient and p-value
spearman_corr, spearman_pvalue = stats.spearmanr(BMI, hours_exercise)

# Print the results
print("Pearson correlation coefficient: {:.3f}".format(pearson_corr))
print("Pearson correlation p-value: {:.3f}".format(pearson_pvalue))
print("Spearman correlation coefficient: {:.3f}".format(spearman_corr))
print("Spearman correlation p-value: {:.3f}".format(spearman_pvalue))


Pearson correlation coefficient: -0.406
Pearson correlation p-value: 0.003
Spearman correlation coefficient: -0.415
Spearman correlation p-value: 0.003


In [21]:
# Sol 4:
import csv
import random

# Generate random data
data = [(random.randint(1, 5), random.randint(1, 10)) for i in range(50)]

# Write data to CSV file
with open('TVHours_data.csv', mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['TV Hours', 'Physical Activity'])
    writer.writerows(data)


In [22]:
import pandas as pd
from scipy import stats

# Load data from CSV file
data = pd.read_csv("TVHours_data.csv")

# Extract TV hours and physical activity data
tv_hours = data["TV Hours"]
physical_activity = data["Physical Activity"]

# Calculate Pearson correlation coefficient and p-value
corr, p_value = stats.pearsonr(tv_hours, physical_activity)

# Print results
print("Pearson correlation coefficient:", corr)
print("p-value:", p_value)


Pearson correlation coefficient: -0.17504699333114418
p-value: 0.22403189166333543


In [24]:
# Sol 5:
'''
To calculate both Pearson and Spearman correlation coefficients, we first need to convert the categorical variable "Soft drink preference" into numerical values. 

We can do this using one-hot encoding, where each brand of soft drink is represented by a binary variable (0 or 1) indicating whether the participant prefers that brand or not.
'''
import pandas as pd
from scipy import stats

# Create DataFrame
df = pd.DataFrame({'Age': [25, 42, 37, 19, 31, 28],
                   'Soft drink Preference': ['Coke', 'Pepsi', 'Mountain dew',
                                            'Coke', 'Pepsi', 'Coke']})

# Convert soft drink preferences to numerical values
drinks = df['Soft drink Preference'].unique()
mapping = {drink: i for i, drink in enumerate(drinks)}
df['Preference Number'] = df['Soft drink Preference'].replace(mapping)

# Calculate Pearson correlation coefficient and p-value
corr, p_value = stats.pearsonr(df['Age'], df['Preference Number'])
print('Pearson correlation coefficient:', corr)
print('p-value:', p_value)

# Calculate Spearman's rank correlation coefficient and p-value
spearman_corr, spearman_p_value = stats.spearmanr(df['Age'], df['Preference Number'])
print("Spearman's rank correlation coefficient:", spearman_corr)
print('p-value:', spearman_p_value)


Pearson correlation coefficient: 0.7587035441865058
p-value: 0.08031134942324102
Spearman's rank correlation coefficient: 0.8332380897952965
p-value: 0.03939551647885117


In [25]:
# Sol 6:
import csv

# Sample data
sales_calls = [10, 15, 12, 8, 11, 13, 17, 14, 9, 16, 11, 13, 12, 10, 14, 18, 11, 13, 15, 12, 16, 13, 9, 12, 15, 14, 11, 10, 17, 14]
sales_made = [2, 3, 4, 1, 3, 3, 5, 4, 2, 4, 3, 3, 2, 2, 4, 5, 3, 2, 4, 2, 5, 4, 1, 3, 5, 4, 3, 2, 5, 4]

# Write data to CSV file
with open('company_sales.csv', mode='w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Sales Calls', 'Sales Made'])
    for i in range(len(sales_calls)):
        writer.writerow([sales_calls[i], sales_made[i]])


In [27]:
import pandas as pd
from scipy import stats

# Read the data from a CSV file
data = pd.read_csv('company_sales.csv')

# Extract sales calls and sales data
sales_calls = data['Sales Calls']
sales_made = data['Sales Made']

# Calculate Pearson correlation coefficient and p-value
corr, p_value = stats.pearsonr(sales_calls, sales_made)

print("Pearson correlation coefficient:", corr)
print("p-value:", p_value)


Pearson correlation coefficient: 0.8721857593881509
p-value: 3.4231169065409406e-10
