In [None]:
import pandas as pd
from matplotlib import pyplot as plt
from scipy import stats
from scipy.stats import pearsonr
from sklearn.linear_model import LinearRegression

Purpose of this notebook:  
1) Conduct data analysis by following a framework  
2) Answering two business questions by runing correlation test and linear regression model.  
3) Content friendly for data science beginners 

<font size=4>Data Analysis Framework</font>

It's always helpful and efficient if we can follow some framework when we do data analysis. Cross Industry Standard Process for Data Mining — CRISP-DM framework is a good one we can use. 
![](https://i.pinimg.com/originals/d3/fe/6d/d3fe6d904580fa4e642225ae6d18f0da.jpg "Problem Solveing Framework") 

<font size=4>Les's Start!</font>  

**1. Business Issue Understanding **  
Q1: Whether we should encourage customers reviews in order to get more subscribers? 

**2. Data Understanding **  
Data Needed:  
- The number of customer reviews of all courses (num_reviews)  
- The number of subscribers of all courses (num_subscribers)

**3. Data Preparation **  
Fortunately, this dataset has been cleaned 

**4. Analysis Modeling **  
**1) What analysis method should we use?**  
To get the the answer of the business question, we need to find out wether there is a positive relationship between num_reviews and num_subscribers.  
Therefore, a correlation test can be used to testify their relationship. 
  
**2) What is correlation test?**  
Correlation test is used in data analysis to better understand the relationship between variables. There are three kinds of correlations:  
**Positive Correlation:** both variables change in the same direction   
**Neutral Correlation:** no relationship in the change of the variables   
**Negative Correlation: **variables change in opposite direction 

**<font color=#0099ff >3) Run correlation test in Python</font>**

In [None]:
# read data 
courses_df = pd.read_csv('../input/udemy-courses/udemy_courses.csv')
# general info of this data
courses_df.info()

It seems we have perfect data, no null, no messy, like a day dream! haha~ 

In [None]:
# get the y-axis, y = num_subscribers
temp_df1 = courses_df['num_subscribers']
y1 = temp_df1.values

# get the x-axis, x = num_reviews
# x_reviews
review_df = courses_df['num_reviews']
x1 = review_df.values

# draw scatter plot 
plt.xlabel('num_reviews')
plt.ylabel('num_subscribers')
plt.scatter(x1,y1)
plt.show()

As we can see the scatter plot above,most plots are centered where num_reviews are less than 10000.  
To get clear presentation, we can try to **filter data with num_reviews <= 10000**

In [None]:
# get the num_subcribers and num_reviews column 
temp_df2 = courses_df.iloc[:,[5,6]]

# filter these two columns with num_reviews <= 10000
filter_cols = temp_df2[temp_df2['num_reviews']<=10000]

# get the new x,y axis after filter 
y2 = filter_cols['num_subscribers'].values
x2 = filter_cols['num_reviews'].values

# draw the trendline this time
slope, intercept, r, p, std_err = stats.linregress(x2,y2)
def myfunc(x2):
    return slope * x2 + intercept

mymodel = list(map(myfunc,x2))

plt.plot(x2,mymodel)

# draw scatter plot again
plt.xlabel('num_reviews')
plt.ylabel('num_subscribers')
plt.scatter(x2,y2)
plt.show()

Looks more clear now! you can try to change the filter condition to see different scatter version. 

**5. Validation **   
Scatter plot is a visual way to see the correlation relationship between two variables. To further validate the correlaion, we can calculate the value of Pearson’s Correlation.  

**1) What is Pearson’s Correlation?**  
The Pearson’s Correlation coefficient is used to summarise the strength of the linear relationship between two data samples. The value of coefficient ranges from -1 to 1.  
**Low correlation:**  value < 0.5 or value > -0.5   
**High positive correlation:** 0.5 <= value <= 1  
**High negative correlation:** -1<= value <= -0.5  
**No correlation: **value = 0  

**<font color=#0099ff >2) Calculate the Pearson’s Correlation in Python</font>**


In [None]:
corr, _ = pearsonr(x2,y2)
'Pearson"s correlation: %.3f' % corr
# we have a high positive correlation!

**6. Conclusion **  
There is high positive relationship between num_reviews and num_subscribers. For business insight, it’s highly possible that we can get more subscribers of a new course if we have more customer reviews of that course. However, to know comprehensive factors that affect the number of subscribers, more variables are needed to be examined, such as review scores, course prices and course durations.


<font size=4>Extension Questions</font>

**Q2: Can we use the number of customer reviews to predict future number of subscribers?**  
To answer this question, we can build a linear regression model based on these two variables,  then test the validity the this model.  

**Why linear regression model?**  
The purpose of linear regression is to “predict” the value of the dependent variable based on the values of one or more independent variables. From the scatter plot of correlation test above, we already seen there is linear relationship between the two variables: num_subscribers and num_reviews. Therefore, we can use linear regression model to predict dependent variable - the number of subscribers based on independent variable - number of reviews.  

**What does linear regression model look like?**  
Dependent Variable = Independent variable * slope + intercept

<font size=4>Les's Start!</font>  

**1. Set our linear regression model**  
Dependent variable: num_subscribers  
Independent variable: num_reviews  

num_variable = num_reviews * slope + intercept

**<font color=#0099ff >2. Build linear regression model in python</font>**

In [None]:
# prepare x and y variables 
y2 = filter_cols['num_subscribers'].values
x2 = filter_cols['num_reviews'].values.reshape(-1, 1)

# built the regression model
model = LinearRegression().fit(x2,y2)

# rounded to two decimal digits
model_coef = float('%.2f' % model.coef_)
model_intercept = float('%.2f' % model.intercept_)

# model expression 
f'y = {model_coef}*x + {model_intercept}'

**3. Validation of the model**  
To validate this model, we can calculate the “p-value” and “r-squared value” of this model.  

**1) What is p-value and r-squared value?**  
Simply, p-value indicates if there is a significant relationship between variables described by the model. The lower the p-value is, the more significant the relationship is. We often consider p-value below 0.05 as a significant level.  
R-squared value measures how well the data is explained by the model. It indicates the proportion of the variability in the dependent variable that is explained by model. For example, r-squared of 0.40 indicates that 40% of the variability in the dependent variable is explained by the model. The value of r-square ranges from 0 to 1. We often consider r-square value above 0.7 as a significant level. 

**<font color=#0099ff >2) Calculate the p-value and r-squared value in Python</font>**

In [None]:
# r-square value
r_sq = model.score(x2,y2)
r_sq = float('%.3f' % r_sq)
f'r-square value:{r_sq}'

# not very high r-square value

**4. Conclusion **  
Although there is significant statistic relationship between num_reviews and num_subscribers, only 40% of variability in the dependent variable can be explained by the model, which are not high enough to build a strong model. Therefore, it will not be very accurate if we use num_reviews to predict future number of subscribers.


**Referrence:**  
https://rcompanion.org/handbook/G_10.html  
https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/  
https://medium.com/@girish.malekar/knowledge-discovery-process-the-problem-solving-framework-8009fa8582fd  

Dateset sourse:  
author:Willden,Chase  
Retrieved May 29, 2020 from [http://theconceptcenter.com/simple-research-study-udemy-courses/].