# The Influence of Social Media Usage on Consumer Purchasing Decisions

In [None]:
import pandas as pd
import plotly.express as px
import scipy.stats as stats
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error, r2_score

In [2]:
df = pd.read_csv("Data_606_Project.csv")

## Introduction

## Background and Significance

## About the DataSet

## Data Cleaning/ Stratified Sampling

## Exploratory Data Analysis

#### 1. Purchase Frequency and Most used platform​

#### 2. Purchasing Trends​

#### 3. Purchasing to fit in with Trends​

#### 4. Ad Frequency vs Platform Usage Frequency​

In [3]:
count_df = df.groupby(['platform_usage_frequency', 'ad_frequency_on_social_media']).size().reset_index(name='count')

fig6 = px.bar(count_df, 
              x="platform_usage_frequency", 
              y="count", 
              color="ad_frequency_on_social_media", 
              title="Ad Frequency vs Platform Usage Frequency",
              labels={"platform_usage_frequency": "Platform Usage Frequency", 
                      "ad_frequency_on_social_media": "Ad Frequency"},
              barmode="stack")
fig6.show()

Users Who Engage Daily See Ads the Most:

The largest group of users falls under "Daily" platform usage. They mostly encounter ads "Occasionally" and "Frequently", with a few seeing ads "Almost every time".

Users Engaging Multiple Times Daily Also Experience Frequent Ads:

This group also sees a high proportion of ads "Frequently" and "Occasionally". Fewer users in this category report ads appearing "Almost every time."

Occasional and Rare Users See Fewer Ads:

Those who use social media "Occasionally" or "Rarely" encounter ads much less frequently. The ad frequency for these groups is mostly "Rarely" and "Occasionally", with very few experiencing frequent ads.

#### 5. Trustworthiness of Ads​

In [None]:
age_group_order = ["Below 18", "18 - 36", "37 - 54", "Above 55"]

fig7 = px.box(df, 
              x="age_group", 
              y="trustworthiness_of_social_media_ads", 
              title="Trustworthiness of Ads by Age Group",
              labels={"age_group": "Age Group", 
                      "trustworthiness_of_social_media_ads": "Trust Level"},
              category_orders={"age_group": age_group_order})
fig7.show()

Younger Users (Below 18) Trust Ads More:

The median trust level is higher than other age groups, around 3 to 4. The interquartile range (IQR) is wider, indicating varying opinions within this group. Some users even rated trustworthiness at the highest level (5).

Middle Age Groups (18-36, 37-54) Show Skepticism:

Their median trust levels are lower (around 2 to 3). The IQR is narrower, meaning most individuals in this range have similar views on ad trustworthiness. A few outliers suggest that some people trust ads more than their peers.

Older Adults (Above 55) Have the Lowest Trust in Ads:

The median trust level is around 2, showing significant skepticism. The IQR extends from 1 to 3, meaning most users in this group rate ad trustworthiness quite low. There are no extreme outliers, suggesting a general consensus among older users.

### Guiding Question 1->Estimating Population Parameters​

### Guiding Question 2 -> Impact of Social Media Usage, Age Group, and Ad Frequency on Unplanned Purchases

### Guiding Question 3 -> Predicting consumer spending using social media factors

We aimed to predict consumer spending behavior based on various social media factors. The variables considered for prediction included:

- content_type_influences_purchasing
- limited_time_offer_purchases
- comparison_factor
- peer_influence_on_purchasing
- trustworthiness_of_social_media_ads

The target variable, which represents whether social media increases spending, was labeled as social_media_increases_spending.

To prepare the data for model training, we employed One-Hot Encoding on the predictor variables to convert categorical features into a numerical format suitable for regression models. The encoding was performed using the pd.get_dummies() function with the drop_first=True argument to avoid multicollinearity, which ensures that we don’t include redundant information.

**Model 1: Linear Regression**

We first tested a Linear Regression model to evaluate the relationship between the predictors and the target variable. The performance of the model was assessed using the following metrics:

- R² score: 0.5495
- Mean Squared Error (MSE): 0.584

In [27]:
X = df[['content_type_influences_purchasing', 
        'limited_time_offer_purchases',
        'comparison_factor',
        'peer_influence_on_purchasing',
        'trustworthiness_of_social_media_ads']]

y = df['social_media_increases_spending']

X_encoded = pd.get_dummies(X, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

print("R² score:", r2_score(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

R² score: 0.5495285069904812
Mean Squared Error: 0.5839907444001519


Next, we performed cross-validation on the linear regression model to evaluate its stability and generalization performance.

In [None]:
cv_scores_linear = cross_val_score(model, X, y, cv=5, scoring='r2')

print("R² scores for each fold:", cv_scores_linear)
print(f"Average R² from cross-validation: {np.mean(cv_scores_linear):.4f}")

R² scores for each fold: [ 0.50474292 -0.00891203  0.64863617  0.48683517  0.41121492]
Average R² from cross-validation: 0.4085


The average R² from cross-validation was 0.4085, which is lower than the initial training R² score, indicating that the model’s performance varies across different subsets of the data.

**Model 2: Decision Tree Regressor**

Next, we tested a Decision Tree Regressor, which is a non-linear model that can capture complex relationships between the features and the target variable. The results from this model were as follows:

- R² score: 0.3504
- Mean Squared Error (MSE): 0.8421

In [30]:
dt_model = DecisionTreeRegressor(random_state=42)
dt_model.fit(X_train, y_train)
y_pred = dt_model.predict(X_test)

print("R² score:", r2_score(y_test, y_pred))
print("Mean Squared Error:", mean_squared_error(y_test, y_pred))

R² score: 0.3504273504273505
Mean Squared Error: 0.8421052631578947


We also performed cross-validation on the decision tree model.

In [15]:
cv_scores_dt = cross_val_score(dt_model, X_encoded, y, cv=5, scoring='r2')

print("R² scores for each fold:", cv_scores_dt)
print(f"Average R² from cross-validation: {np.mean(cv_scores_dt):.4f}")

R² scores for each fold: [-0.16634273  0.08604336  0.41135972  0.55555556  0.55371901]
Average R² from cross-validation: 0.2881


The linear regression model proved to be more effective at predicting consumer spending based on social media factors, although its performance was still suboptimal when tested with cross-validation. The decision tree model performed worse, suggesting it is not well-suited for this particular dataset.

### Guiding Question 4 -> Are influencer promotions more effective than brand advertisements in influencing purchases?​

### Guiding Question 5 -> Association Between Ad Frequency and Platform Usage Frequency

Null Hypothesis: Ad frequency is independent of platform usage frequency.

Alternative Hypothesis: Ad frequency is dependent on platform usage frequency.

To test the hypothesis, we used an independent t-test (stats.ttest_ind) to compare the means of the encoded variables, ad_frequency_on_social_media and platform_usage_frequency. The t-test assesses whether there is a significant difference between the two groups' means.

In [29]:
df_copy_test=df.copy()
label_encoder = LabelEncoder()
df_copy_test['ad_frequency_on_social_media'] = label_encoder.fit_transform(df['ad_frequency_on_social_media'])
df_copy_test['platform_usage_frequency'] = label_encoder.fit_transform(df['platform_usage_frequency'])
t_stat, p_value = stats.ttest_ind(df_copy_test["ad_frequency_on_social_media"], df_copy_test["platform_usage_frequency"])
print(f"P-value: {p_value}")

P-value: 0.00021043509448477227


Since the p-value is less than 0.05, we reject the null hypothesis. This means that ad frequency is dependent on platform usage frequency, indicating a relationship between the two variables.

## Conclusion and Future Scope

- Surveys and statistical analysis, as discussed during lectures, were used to collect and interpret data, supporting our hypothesis.​

- Since the p-value for the t-test comes out to be less than 0.05 we can conclude that Ad frequency is dependent on platform usage frequency.​

- Linear Regression(R squared: 0.4085) is better able to predict the consumer spending as compared to the Decision Tree Regressor(0.2881).​

- Influencer promotions more effective than brand advertisements in influencing purchases as verified by Decision Tree model and OLS regression.​

- Ad frequency is dependent on platform usage frequency.​

## References

[1] Google, "The Influence of Social Media Usage on Consumer Purchasing Decisions,"GoogleForms.[Online]. Available:https://docs.google.com/forms/d/e/1FAIpQLSfN6FVKhkTBqA_7D8FMbNO17M7GtQrUU5yqwmsBgPKvUTVmoA/viewform. [Accessed: 15-Jan-2025].​

[2] Nielsen, J. and Budiu, R., "Survey Best Practices," Nielsen Norman Group, 2012. [Online]. Available: https://www.nngroup.com/articles/survey-best-practices/. [Accessed: 15-Jan-2025].​

[3] Wikipedia contributors, "Binomial distribution," Wikipedia, The Free Encyclopedia.[Online].Available: https://en.wikipedia.org/wiki/Binomial_distribution. [Accessed: 15-Jan-2025]​

[4] Wikipedia contributors, "Poisson distribution," Wikipedia, The Free Encyclopedia.[Online].Available: https://en.wikipedia.org/wiki/Poisson_distribution. [Accessed: 15-Jan-2025].​

[5] DASCA, "What is statistical modelling in data science," DASCA - World of Data Science. [Online]. Available: https://www.dasca.org/world-of-data-science/article/what-is-statistical-modeling-in-data-science. [Accessed: 20-Jan-2025]