### **Statistical Intuitions and Applications**

# **Assignment #1**


# **Background**


Imagine that you have graduated with a degree in data science and now work as a consultant. You are hired by a social media analytics company that specializes in optimizing influencer strategies. The company has collected detailed data from **500** social media influencers. This data includes information about the influencer's demographics, educational background, social media activity, and earnings.

The company wants you to analyze this data in ways that can help them design personalized recommendations for influencers to improve their engagement rates, follower growth, and income potential.

In Assignment 1, you will take a random sample of **200** influencers from the 500 individuals who were studied and analyze the data for these 200 influencers.

**Note:** The entire dataset (and descriptions of each of the variables) can be found  [here](https://docs.google.com/document/d/1G7H4VSNSMvJMoPR0AhMVEjoVgQmjEcZGUBpkq_ISEjI/edit?usp=sharing).



**Important Information**:   

1. Read all the instructions carefully before you begin!
2. You will need to save the (.ipynb) file as a ***searchable*** PDF and NOT as a picture. Likewise, your answers and your codes must be submitted as searchable PDF. Pictures or snapshots of your work will NOT be accepted.
4. The generated csv file and .ipynb file must be submitted in a zip-folder as a secondary source.
5. You may use Jupyter notebook or Colab as per your convenience.

Non-compliance with the above instructions will result in a 0 grade on the relevant portions of the assignment. Your instructor will grade your assignment based on what you submitted. Failure to submit the assignment or submitting an assignment intended for another class will result in a 0 grade, and resubmission will not be allowed. Make sure that you submit your original work. Suspected cases of plagiarism will be treated as potential academic misconduct and will be reported to the College Academic Integrity Committee for a formal investigation. As part of this procedure, your instructor may require you to meet with them for an oral exam on the assignment.


**IMPORTANT**: Run the code below. It will load in the packages that you need to complete the tasks below.

In [2]:
# Following libraries will be loaded so that these can be applied in codes.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random
import scipy.stats as stats

**Task 1.**

*   As mentioned above, you will select a random sample of **200** individuals from the company's data set.
*   You will then conduct analyses on this random sample.
*   Look at the code below. To select a random sample from the data, you should replace **Name** with your own name in the code.
*   After you have done so run the code. The code will generate a csv file with a random sample of 200 participants. It will also be labeled with your name.
*   **REMEMBER:** you need to add this csv file to a zip file along with your .ipynb. file when submitting your assignment.

In [1]:
# The code below will generate a random sample of 200 participants for you to analyze.

# You need to replace "Full_Name" in the code below with your own full name.
#You need to Replace STUDENT_ID with your own ID number for reproducibility.
# The code will then generate a csv file that is labeled with your name and which contains a random sample of 200 individuals.
# REMEMBER: you need to submit this csv file in the zip folder when submitting your assignment.

student_name = "Alyaziya Almansoori"  # Replace "Full_Name" with your actual full name
file_name = f'{student_name}.csv'

try:
    df = pd.read_csv(file_name)  # Read the existing file if it exists
except FileNotFoundError:
    original_data = pd.read_csv("https://raw.githubusercontent.com/ZUCourses/SIA-Public/main/Social_Media_Influencer_Dataset_Final.csv")
    df = original_data.sample(200, random_state=STUDENT_ID)  # Replace STUDENT_ID with your own ID number for reproducibility
    df.to_csv(file_name, index=False)  # Save the sample to a CSV file
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
df.head()


NameError: name 'pd' is not defined

**Task 2.** (#Variables)

* Now that you have your dataset, you are ready to start analyzing it!
* The first step is to explore your dataset.
* Once you've done this, imagine you are writing a report for the social media analytics company that hired you.
* Start with a brief introduction to the research question you are exploring, then the dataset you are analysing (e.g., what is the sample you are analyzing? What are the variables?)
* Your audience is the company's leadership team, who will use your insights to inform strategic decisions.

Add the **brief introduction** below.

**Task 2 Answer:**





DOUBLE CLICK TO **ADD YOUR TEXT FOR TASK 2 HERE**

**Task 3.** (#DescriptiveStats)

*   Run the below code. It will randomly select **4** variables from your dataset. It will then print the names of the four variables that were randomly selected.
***IMPORTANT: ONLY RUN THIS CODE BLOCK ONCE.***



In [2]:
import random

column_titles = ["Monthly Average Number of Posts Created","Avg Time Spent on Platform (minutes/day)","Monthly Average Number of Likes Given","Monthly Average Number of Comments Received",
"Monthly Average Number of Shares","Number of Followers","Monthly Average Number of Views", "Engagement Rate (%)", "Annual Income in USD","Age"]

selected_columns = random.sample([col for col in column_titles if col not in ["Engagement Rate (%)", "Annual Income in USD"]], 3)

selected_columns.append(random.choice(["Engagement Rate (%)", "Annual Income in USD"]))

print("Selected Variables:", selected_columns)


* Your task is to do the following:

1.   **Create a histogram and generate descriptive statistics for each of the four variables randomly selected above**.
You can use the code provided below to assist you.

2.  For each variable, describe the following aspects of the distribution:
  *   **Shape**
  *   **Center**
  *   **Spread**
  *   **Outliers**
3. Describe and interpret your findings in **context**: How do the aspects of the distribution for each of your variables relate to the scenario under study? In other words, consider how these insights contribute to understanding trends within the context of the data.

In [None]:
#Sample code:
import pandas
import matplotlib.pyplot as plt

# Remember: you will need to replace "Column" with the name of the variable that you are visualizing and describing

#plot the histogram
plt.hist(df["Column"],bins = XX) #replace XX with the number of bins
plt.title("Column")
#produce descriptive statistics
print ("Column")
df["Column"].describe()

**Task 3 Answer:**




In [None]:
#Write your code for Task 3 here:

DOUBLE CLICK TO **ADD YOUR TEXT FOR TASK 3 HERE**

**Task 4.** (#Visualizations)

Now that you have described and plotted data, let's explore if the data differ for male and female participants.

*   Generate grouped box plots for each of **the 4 variables in Task 3**.  Use the sample code below to assist you.
*   Your boxplot should **compare** the distributions for males and females in your dataset.
*   Afterwards, you should describe what you observe in each case.
*   Ensure you mention the **five-number summaries** for each genders and provide their **interpretation in context**.

In [None]:
#Sample Code
import pandas
import matplotlib.pyplot as plt
from numpy import percentile

# Filter the DataFrame based on Gender
male = df[df["Gender"] == "Male"]
female = df[df["Gender"] == "Female"]

# Replace 'Column' with the actual variable name you want to analyze
data1 = male['Column']
data2 = female['Column']
data = list([data1, data2])

# Create boxplots side by side
fig, ax = plt.subplots()
ax.set_xticklabels(['male', 'female'])
plt.grid(axis="y")
plt.boxplot(data)

# Descriptive statistics
print("Descriptive Statistics for Male Students' Column")  # Replace 'Column' with the variable name
print(male['Column'].describe())
print("Descriptive Statistics for Female Students' Column")  # Replace 'Column' with the variable name
print(female['Column'].describe())


In [None]:
#Write your code for Task 4.

**Task 4 Answer:**




DOUBLE CLICK TO **ADD YOUR TEXT FOR TASK 4 HERE**

**Task 5.** (#Correlation, #CompProgramDesign)


**Part A**

* Select **TWO appropriate independent variables**, along with **ONE relevant dependent variable**, from Task 3.
* Now, create two scatterplots to show the correlation between each independent variable and the dependent variable (one scatterplot for **each independent** variable).
* For each scatterplot you should visualize the relationship between the independent and dependent variables.
* For each scatterplot, describe and interpret the relationship in terms of:

  * **Form**
  * **Strength**
  * **Direction**
* Describe and interpret your findings **in context**: How do the form, strength, and direction of the relationship between each pair of variables reflect the scenario under study? In other words, consider how these insights help in understanding the relationships between the independent and dependent variables, within the context of the data.

**Part B**

* Investigate whether the relationship observed in each scatterplot varies **by gender**.

**Hint:**
To answer this, you will need to create separate scatterplots for each gender (male and female) to compare how the relationships differ.


In [None]:
#Sample code:
import pandas
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as statsmodels

#display the correlation coefficient
corr=stats.stats.pearsonr(df["Column A"],df["Column B"])[0]
print("correlation coefficient =", corr)

#display the scatterplot
plt.scatter(df["Column A"],df["Column B"])
plt.title ("Column A vs Column B")
plt.xlabel("Column A")
plt.ylabel("Column B")
plt.show()

**Task 5 Answer:**




In [None]:
#Write your code for Task 5.

DOUBLE CLICK TO **ADD YOUR TEXT FOR TASK 5 HERE**

**Task 6.** (#Correlation, #Visualizations, #CompProgramDesign)

**Part A**

For this task, consider the same two independent variables and the dependent variable you chose in **Task 5**.

For each independent variable, follow these steps:

* Fit a **simple linear regression model** that predicts the dependent variable you chose based on the selected independent variable.
* Generate, interpret, and use the following to assess the fit of each linear model:
  * **Residual plot**
  * **Standard error**
  * **R² value**
* If the model is a good fit, interpret the **slope** and the **y-intercept** for each independent variable in relation to the dependent variable. (**Note:** If you find that the linear model does not provide a good fit based on these metrics, clearly state this as the reason for not providing the interpretation.)

* Describe and interpret your findings **in context**: How do the residual plot, standard error, R² value, slope, and y-intercept reflect the fit and accuracy of the model in the scenario being studied? In other words, consider how these insights help in understanding the relationship between the independent and dependent variables within the context of the data.

**Part B**

If in Task 5 you observe that the relationship between the dependent variable and the independent variables differs by gender, then you need to:

* Run the regression model separately for each gender (male and female) and interpret the findings.
* Compare the results for each gender and describe how the relationship changes, if at all.
***Note:** If you do not find a difference in the relationship between genders, clearly explain why you are not completing Part B of the analysis in Task 6.






In [None]:
#Sample code:
import pandas
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as statsmodels
import seaborn as sns
def regression_equation(column_x, column_y):
    # fit the regression line using "statsmodels" library:
    X= df[column_x]
    X = statsmodels.add_constant(X)
    Y = df[column_y]
    regressionmodel = statsmodels.OLS(Y,X).fit() #OLS stands for "ordinary least squares"
    print('R2: ', round(regressionmodel.rsquared, 3))
    SE=np.sqrt(regressionmodel.mse_resid)
    print ('SE=', round(SE, 3))

     #display the correlation coefficient
    correlation_coefficient=stats.stats.pearsonr(df[column_x],df[column_y])[0]
    print("correlation_coefficient=", round(correlation_coefficient,3))

    # extract regression parameters from model, rounded to 2 decimal places and print the regression equation:
    slope = round(regressionmodel.params[1],3)
    intercept = round(regressionmodel.params[0],3)
    print("Regression equation: "+column_y+" = ",slope,"* "+column_x+" + ",intercept)

    #display the scatter plot with the line of best fit
    plt.scatter(df[column_x], Y, color='green')
    plt.xlabel(column_x)
    plt.ylabel(column_y)
    plt.plot(df[column_x], regressionmodel.params[1]*df[column_x]+regressionmodel.params[0], color='red')
    plt.show()
    #display the residual plot
    sns.residplot(x = column_x,
              y = column_y,
              data = df)
    plt.show()
    #display the residual plot with SE
    sns.residplot(x = column_x,
              y = column_y,
              data = df)
    plt.axhline(y=SE, color='r', linestyle='--')
    plt.axhline(y=-SE, color='r', linestyle='--')
    plt.show()
regression_equation("Independent Variable", "Dependent Variable")

**Task 6 Answer:**




In [None]:
#Write your code for Task 6.

DOUBLE CLICK TO **ADD YOUR TEXT FOR TASK 6 HERE**