# Project: Investigate Udemy Dataset

### Table of Contents

**In this notebook:**

* [Introduction](#Introduction)
* [Data Wrangling](#Data_Wrangling)
* [Exploratory Data Analysis](#Exploratory_Data_Analysis)
* [Conclusions](#Conclusions)

## Introduction

## About Udemy Dataset

#### This Data set contains 3000+  entries for all courses data & subjects in Udemy application which we need to investigate to answer questions about subscribers and courses prices

### Based on the dataset, we will work on answering the following Questions:

* How many courses for each subject is udemy offering?
* Which subject has the maximum number of Courses?
* How many free courses are there for every subject?
* How many paid courses are there for every subject?
* What are the Top selling books?
* What the courses that are pubished in year 2015?
* What is the maximum number of Subscribers for each level of courses?
* Is there a relationship between number of lectures and number of subscribers?
* Is there a relationship between the price of the course and the number of subscribers
* What is the kind of relationship between price and number of lectures

In [None]:
# Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
# load Dataset

df=pd.read_csv('../input/udemy-courses/udemy_courses.csv')

## Data_Wrangling

In [None]:
df.info()

In [None]:
df.head(5)

In [None]:
df.tail()

In [None]:
# checking for duplicates

df.duplicated().all()

### Observations

1. Data consists of 3678  Rows and 11 columns
2. There are no null values in the data
3. There are no duplicated rows in the Data
4. Data types vary among int,object,bool

## Data Cleaning

**Fisrt**,I Changed the Published_timestamp column data type from string to Datetime 

In [None]:
# Changing Published_timestamp column data type from string to Datetime 

df['published_timestamp']=pd.to_datetime(df['published_timestamp'])

**Second**,I Created a new column with the year values only to answer a related question

In [None]:
# Creat a new column with the year values onlyto answer related questions

df['Year']=df['published_timestamp'].dt.year

**Third**,Change the price column data type into a int. inorder to do that i have to vhange the value ('Free') in the price
column to and make it '0' instead and then transfer it into int.

In [None]:
df.replace({'Free':0},inplace=True)

In [None]:
df['price']=pd.to_numeric(df['price'])

In [None]:
# check for final results

df.info()

## Exploratory_Data_Analysis

Now that the data is clean, let us start our investigation of the above questions. But, first, let's check data statistics and plot some histograms to get an overview of the distribution of different variables and then let's create a heatmap to see how different variables are correlated.

In [None]:
df.describe()

In [None]:
df.hist(column=['num_subscribers','num_reviews','num_lectures'],facecolor='g',figsize=(16,24),layout=(5,2),alpha=0.75,bins=100);                                                                      

## Observations

* Number of lectures in different courses are mostly concentrated from 0 to 100 as per the histogram and so few from 100 to 300
* Number of subscribers for the different courses is between 0 to 20000 and so few from 30000 to 50000
* Number reviews for different courses are concentrated between 0 and 1000 and so few from 1000 to 5000

In [None]:
Matrix=df[['is_paid','num_subscribers','num_reviews','num_lectures','Year','price']].corr()
Matrix

In [None]:
plt.figure(figsize=(10,5));
sns.heatmap(Matrix,annot=True,cmap='coolwarm',linewidth=1.5);

## Observations

1.Strong correlation:

* no strong correlation between variables

2-Moderate correlations:

* num_subscribers and num_reviews
* num_lectures and num_reviews


## Questions

### Q1.How many courses for each subject is udemy offering?

In [None]:
count_courses=df.groupby(['subject'])['course_id'].count()
count_courses

In [None]:
count_courses.plot.bar(figsize=(8,5),color='blue');

Udemy is offering Courses for:

1. Business Finance
2. Graphic Design
3. Musical instruments
4. Web Development

### Q2.Which subject has the maximum number of Courses

In [None]:
Maximum_courses=df.groupby(['subject'])['course_id'].count().sort_values(ascending=False)
colors=['#e377c2', '#7f7f7f', '#bcbd22', '#17becf']
plt.figure(figsize=(10,5));
Maximum_courses.plot.bar(color=colors,zorder=3);
plt.title('Number of subjects per courses',size=18);
plt.ylabel('No. of subjects',size=18);
plt.xlabel('Courses',size=18);
plt.grid(zorder=0)

In [None]:
Maximum_courses.head(1)

#### Maximum number of subjects are in Web Development

### Q3.How many free courses are there for every subject?

In [None]:
Free_courses=df[df['price']==0].groupby(['subject'])['course_id'].count()
Free_courses

In [None]:
Free_courses.plot.line(figsize=(10,5),markersize=15,marker='o',label='Free courses')
plt.grid()
plt.legend(fontsize=12)
sns.set_style("white");

Count of free courses for every subject are

1. Business Finance-->96
2. Graphic Design-->35
3. Musical Instruments-->46
4. Web Development-->133

### Q4.How many paid courses are there for every subject?

In [None]:
Paid_courses=df[df['price']!='Free'].groupby(['subject'])['course_id'].count()

In [None]:
Paid_courses.plot.pie(figsize=(10,5),autopct='%0.f%%',explode=[0.05,0.05,0.05,0.05],radius=1.2);

The count of paid courses for every subject are

Paid_courses
1. Business Finance--->1199
2. Graphic Design--->603
3. Musical Instruments--->680
4. Web Development--->1200

### Q5.What are the Top selling books?

In [None]:
Top_selling=df.groupby(['subject'])['num_subscribers'].sum().sort_values(ascending=False)

In [None]:
colors2=['#2ca02c', '#d62728', '#9467bd','#8c564b']
plt.figure(figsize=(10,5))
Top_selling.plot.bar(color=colors2,zorder=3);
plt.xlabel('Subjects',size=18);
plt.ylabel('number of subscribers',size=18);
plt.title('Number of Subscribers vs. Subjects',size=18);
plt.grid(zorder=0)

#### Most selling books are for Web development

### Q6.What the courses that are pubished in year 2015?

In [None]:
df[df['Year']==2015]['course_title'].unique()

###  Q7.What is the maximum number of Subscribers for each level of courses?

In [None]:
maximum_subscribers=df.groupby(['subject'])['num_subscribers'].max()
maximum_subscribers

In [None]:
maximum_subscribers.plot.line(figsize=(10,5),marker='*',markersize=15,label='Maximum subscribers');
plt.grid();
plt.legend(fontsize=12);

#### Maximum number of subscribers is for web develoment subject

### Q8.Is there a relationship between number of lectures and number of subscribers?

In [None]:
#calculating pearson correlation between variables

df[['num_lectures','num_subscribers']].corr()

In [None]:
# Visualizing the Relation between them

plt.figure(figsize=(10,5));
sns.regplot(x='num_lectures',y='num_subscribers',data=df,color='g');
plt.title('Number of subscribers Vs. Number of lectures',fontsize=18);

#### We can interpret from the visual and pearson correlation coefficient value of 0.15 that there is almost no relationship between number of  subscribers and number of lectures

### Q9. Is there a relationship between the price of the course and the number of subscribers?

In [None]:
#now lets run the visualization, we added the annotate(stats.pearsonr) to show the pearson correlation

plt.figure(figsize=(10,5));
sns.jointplot(x='num_subscribers',y='price',data=df,color='g',kind='reg');
plt.ylim(0,300);
plt.xlabel('Number of subscribers',size=18);
plt.ylabel('Price',size=18);

#### We can interpret from the visual and the pearson correlation coefficient value of 0.051 that there is almost no relationship between increase in the price and increaseinn number of subscribers

### Q.10 What is the kind of relationship between price and number of lectures

In [None]:
#calculating pearson correlation between variables

df[['price','num_lectures']].corr()

In [None]:
# Visualizing the Relation between them

plt.figure(figsize=(10,5));
sns.regplot(x='num_lectures',y='price',data=df,color='g');
plt.title('Price Vs. Number of lectures',fontsize=18);
plt.ylabel('Price',size=18);
plt.xlabel('Number of lectures',size=18);

#### The visual and the pearson correlation of 0.33 shows that here is a moderate relationship between Price and number of lectures which means as the number of lectures increases the price slightly increases

## Conclusions

### Findings

1- Udemy is offering Courses for:

1. Business Finance
2. Graphic Design
3. Musical instruments
4. Web Development

2- Maximum number of subjects are in Web Development

3- The count of free courses for every subject are

1. Business Finance-->96
2. Graphic Design-->35
3. Musical Instruments-->46
4. Web Development-->133

4- The count of paid courses for every subject are

1. Business Finance--->1199
2. Graphic Design--->603
3. Musical Instruments--->680
4. Web Development--->1200

5- Most selling books are for Web development

6- These are the Book tites that was published in 2015 Concepts of Statistics For Beginners Step by Step','10 Numbers Every Business Owner Should Know','101 Blues riffs - learn how the harmonica superstars do it', ...,'Your Own Site in 45 Min: The Complete Wordpress Course','Your Second Course on Piano: Two Handed Playing','Zend Framework 2: Learn the PHP framework ZF2 from scratch.

7- Maximum number of subscribers is for web develoment subject


8- We can interpret from the visual and pearson correlation coefficient value of 0.15 that there is almost no relationship between number of subscribers and number of lectures

9- We can interpret from the visual and the pearson correlation coefficient value of 0.051 that there is almost no relationship between increase in the price and increaseinn number of subscribers

10- The visual and the pearson correlation of 0.33 shows that here is a moderate relationship between Price and number of lectures which means as the number of lectures increases the price slightly increases