<a href="https://colab.research.google.com/github/talamo13/Intro-To-Data-Science-Assignments/blob/Udemy-Courses/Udemy_Course.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Udemy Course Exploration**

##**Context**

Udemy is a massive online open course (MOOC) platform that offers both free and paid courses. Anybody can create a course, a business model by which allowed Udemy to have hundreds of thousands of courses.

##**About Dataset**

This dataset contains 3,673 records of courses from 4 subjects (Business Finance, Graphic Design, Musical Instruments, and Web Design) taken from Udemy

A total of 12 variables are provided as listed below:

| Variable Name(s)  | Description                                     |
|-------------------|-------------------------------------------------|
| course_id         | id field for courses                            |
| course_title      | title field for courses                         |
| url               | url field to course page                        |
| is_paid           | True for Paid / False for Free                  |
| price             | price field for course fee                      |
| num_subscribers   | demand field for each courses                   |
| num_reviews       | review number for each courses                  |
| num_lectures      | lecture per course                              |
| level             | course level by trainee experience              |
| content_duration  | course duration in hours                        |
| published_timestamp | timestamp field for publication              |
| subject           | course type field                               |

A snippet of the data is as follows:


In [2]:
import pandas as pd

df = pd.read_csv('https://raw.githubusercontent.com/talamo13/Intro-To-Data-Science-Assignments/Udemy-Courses/Udemy-Courses-Data.csv')
df.iloc[0:5]

Unnamed: 0,course_id,course_title,url,is_paid,price,num_subscribers,num_reviews,num_lectures,level,content_duration,published_timestamp,subject
0,1070968,Ultimate Investment Banking Course,https://www.udemy.com/ultimate-investment-bank...,True,200,2147,23,51,All Levels,1.5,2017-01-18T20:58:58Z,Business Finance
1,1113822,Complete GST Course & Certification - Grow You...,https://www.udemy.com/goods-and-services-tax/,True,75,2792,923,274,All Levels,39.0,2017-03-09T16:34:20Z,Business Finance
2,1006314,Financial Modeling for Business Analysts and C...,https://www.udemy.com/financial-modeling-for-b...,True,45,2174,74,51,Intermediate Level,2.5,2016-12-19T19:26:30Z,Business Finance
3,1210588,Beginner to Pro - Financial Analysis in Excel ...,https://www.udemy.com/complete-excel-finance-c...,True,95,2451,11,36,All Levels,3.0,2017-05-30T20:07:24Z,Business Finance
4,1011058,How To Maximize Your Profits Trading Options,https://www.udemy.com/how-to-maximize-your-pro...,True,200,1276,45,26,Intermediate Level,2.0,2016-12-13T14:57:18Z,Business Finance


##**Assignment 3**

**Statistical Inference for One Sample Proportion and Mean**

This assignment is on inferential statistics. Use SPSS to conduct the analysis and complete each of the following questions. As appropriate, copy the SPSS output and paste it into the correct part below. For problems that require a written response, type the answer below.

**NOTE: Delete the questions you are not answering.**

###**Last Name: A-L**

####**Question 1**

Construct and interpret the 95% confidence interval for the population proportion of all Udemy courses that have the subject area in business finance. Write up the solution using the PMACC procedure.

In [7]:
import numpy as np
from scipy import stats

# Load your dataset into a Pandas DataFrame
# Assuming your DataFrame is named df and contains the relevant columns
# Replace 'df' with the name of your actual DataFrame

# Assuming 'subject_area' is the column containing the subject area of Udemy courses
# 'business_finance' is the category indicating courses in business finance
# Replace 'subject_area' and 'business_finance' with your actual column and category names

# Filter the DataFrame to include only courses in business finance
business_finance_courses = df[df['subject'] == 'business_finance']

# Calculate the sample proportion
sample_proportion = len(business_finance_courses) / len(df)

# Sample size
n = len(df)

# Significance level (alpha)
alpha = 0.05

# Calculate the standard error of the proportion (SE)
SE = np.sqrt((sample_proportion * (1 - sample_proportion)) / n)

# Find the z-value for a 95% confidence interval
z_value = stats.norm.ppf(1 - alpha/2)

# Calculate the margin of error (ME)
ME = z_value * SE

# Calculate the confidence interval (CI)
confidence_interval = (sample_proportion - ME, sample_proportion + ME)

# Write up the solution
print("PMACC Procedure:")
print("Data Summary:")
print("Sample Proportion (p̂):", sample_proportion)
print("Successes:", 1194)
print("Sample Size (n):", n)
print("Significance Level (α):", alpha)
print("\nConfidence Interval:")
print("95% Confidence Interval (CI):", confidence_interval)

PMACC Procedure:
Data Summary:
Sample Proportion (p̂): 0.0
Successes: 1194
Sample Size (n): 3673
Significance Level (α): 0.05

Confidence Interval:
95% Confidence Interval (CI): (0.0, 0.0)


####**Question 2**

Is there convincing evidence that the population mean course fee is more than $50? Use α=0.1. Write up the solution using the PMACC procedure.

In [9]:
# Extract the column containing the course fees
course_fees_column = df['price']

# Calculate the sample mean and standard deviation
sample_mean = course_fees_column.mean()
sample_std = course_fees_column.std(ddof=1)  # Use ddof=1 for sample standard deviation

# Sample size
n = len(course_fees_column)

# Hypothesized population mean
mu_0 = 50

# Significance level (alpha)
alpha = 0.1

# Calculate the t-statistic
t_statistic = (sample_mean - mu_0) / (sample_std / (n ** 0.5))

# Degrees of freedom
df = n - 1

# Find the critical t-value for a one-tailed test with alpha=0.1 and df degrees of freedom
critical_t_value = stats.t.ppf(1 - alpha, df)

# Compare the t-statistic with the critical t-value
if t_statistic > critical_t_value:
    decision = "Reject the null hypothesis. There is convincing evidence that the population mean course fee is more than $50."
else:
    decision = "Fail to reject the null hypothesis. There is not enough evidence to conclude that the population mean course fee is more than $50."

# Write up the solution
print("PMACC Procedure:")
print("Hypotheses:")
print("H0: The population mean course fee is $50 or less.")
print("H1: The population mean course fee is more than $50.")
print("\nData Summary:")
print("Sample Mean (x̄):", sample_mean)
print("Sample Standard Deviation (s):", sample_std)
print("Sample Size (n):", n)
print("Hypothesized Population Mean (μ0):", mu_0)
print("Significance Level (α):", alpha)
print("\nTest Statistic:")
print("t-statistic:", t_statistic)
print("\nDegrees of Freedom:")
print("df:", df)
print("\nCritical t-value:")
print("critical t-value:", critical_t_value)

PMACC Procedure:
Hypotheses:
H0: The population mean course fee is $50 or less.
H1: The population mean course fee is more than $50.

Data Summary:
Sample Mean (x̄): 66.04955077593247
Sample Standard Deviation (s): 61.02793429320322
Sample Size (n): 3673
Hypothesized Population Mean (μ0): 50
Significance Level (α): 0.1

Test Statistic:
t-statistic: 15.938398287566057

Degrees of Freedom:
df: 3672

Critical t-value:
critical t-value: 1.2817821592966705


####**Question 3**

Generate a paragraph of at least 100 words to address one of the following questions

a) Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major.

b) Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career.

###**Last Name: M-Z**

####**Question 1**

Construct and interpret the 90% confidence interval for the population mean course fee. Write up the solution using the PMACC procedure.

In [11]:
# Calculate the sample mean and standard deviation
sample_mean = course_fees_column.mean()
sample_std = course_fees_column.std(ddof=1)  # Use ddof=1 for sample standard deviation

# Sample size
n = len(course_fees_column)

# Significance level (alpha)
alpha = 0.10

# Calculate the standard error of the mean (SE)
SE = sample_std / (n ** 0.5)

# Degrees of freedom
df = n - 1

# Find the t-value for a 90% confidence interval and df degrees of freedom
t_value = stats.t.ppf(1 - alpha/2, df)

# Calculate the margin of error (ME)
ME = t_value * SE

# Calculate the confidence interval (CI)
confidence_interval = (sample_mean - ME, sample_mean + ME)

# Write up the solution
print("PMACC Procedure:")
print("Data Summary:")
print("Sample Mean (x̄):", sample_mean)
print("Sample Standard Deviation (s):", sample_std)
print("Sample Size (n):", n)
print("Significance Level (α):", alpha)
print("\nConfidence Interval:")
print("90% Confidence Interval (CI):", confidence_interval)

PMACC Procedure:
Data Summary:
Sample Mean (x̄): 66.04955077593247
Sample Standard Deviation (s): 61.02793429320322
Sample Size (n): 3673
Significance Level (α): 0.1

Confidence Interval:
90% Confidence Interval (CI): (64.39280816615695, 67.706293385708)


####**Question 2**

Is there convincing evidence that the population proportion of all insured who were smokers is more than 30%? Use α=0.05. Write up the solution using the PMACC procedure

####**Question 3**

Generate a paragraph of at least 100 words to address one of the following questions:

a) Discuss how analyzing your chosen data set using statistical methods could help you become better prepared for future courses in your major.

b) Discuss how analyzing your chosen data set using statistical methods could be instrumental in becoming better prepared for your future career.