**This is an Exploratory Data Analysis on the Udemy dataset<br />
Name of dataset: IT & Software Courses Udemy - 22k+ courses<br />
Source of dataset: https://www.kaggle.com/jilkothari/it-software-courses-udemy-22k-courses<br />
<br />
This project will be focusing on answering the following questions:**


In [12]:
# To install new version of ploty & swifter package
#!pip3 install plotly==4.8
#!pip3 install swifter

# install pandas profiling package as it is not in colab
# -I force reinstall even if already installed
#!pip3 install -I pandas_profiling
#!pip install numpy==1.20.0

# Error installing pycaret
# pycaret is a low code machine learning package
# shap is a package used by pycaret to explain the outputs of a model
#!pip3 install pycaret --user
#!pip3 install shap --user



In [2]:
# Import the relevant packages

# for basic data cleaning operation
import numpy as np
import pandas as pd

# for visualisation
import matplotlib
import matplotlib.pyplot as plt
plt.style.use('ggplot')
from matplotlib.pyplot import figure
import plotly.express as px

%matplotlib inline
# Adjusts the configuration of the plots
matplotlib.rcParams['figure.figsize'] = (12,8)

# for machine learning
# from pycaret.regression import *
# documentation: https://pycaret.readthedocs.io/en/latest/api/regression.html

# To control visual output
#pd.set_option('display.max_rows',15)

Getting an initial feel of the data
1. Look at the first 5 rows of data
2. Identify how much data are missing for each column
3. Look at the number of unique values for each column
4. Getting indepth information for the data (Data types, number of rows, number of non-null)

In [3]:
# Importing data from file
folder_path = "C:/Users/wilso/OneDrive/Desktop/Portfolio_Project/Datasets"
file = 'Udemy_data.csv'

df = pd.read_csv(folder_path + "/" + file)

In [4]:
# Initial look at the data

df.head(10)

Unnamed: 0,id,title,url,is_paid,num_subscribers,avg_rating,avg_rating_recent,rating,num_reviews,is_wishlisted,num_published_lectures,num_published_practice_tests,created,published_time,discount_price__amount,discount_price__currency,discount_price__price_string,price_detail__amount,price_detail__currency,price_detail__price_string
0,762616,The Complete SQL Bootcamp 2020: Go from Zero t...,/course/the-complete-sql-bootcamp/,True,295509,4.66019,4.67874,4.67874,78006,False,84,0,2016-02-14T22:57:48Z,2016-04-06T05:16:11Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
1,937678,Tableau 2020 A-Z: Hands-On Tableau Training fo...,/course/tableau10/,True,209070,4.58956,4.60015,4.60015,54581,False,78,0,2016-08-22T12:10:18Z,2016-08-23T16:59:49Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
2,1361790,PMP Exam Prep Seminar - PMBOK Guide 6,/course/pmp-pmbok6-35-pdus/,True,155282,4.59491,4.59326,4.59326,52653,False,292,2,2017-09-26T16:32:48Z,2017-11-14T23:58:14Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
3,648826,The Complete Financial Analyst Course 2020,/course/the-complete-financial-analyst-course/,True,245860,4.54407,4.53772,4.53772,46447,False,338,0,2015-10-23T13:34:35Z,2016-01-21T01:38:48Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
4,637930,An Entire MBA in 1 Course:Award Winning Busine...,/course/an-entire-mba-in-1-courseaward-winning...,True,374836,4.4708,4.47173,4.47173,41630,False,83,0,2015-10-12T06:39:46Z,2016-01-11T21:39:33Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
5,1208634,Microsoft Power BI - A Complete Introduction [...,/course/powerbi-complete-introduction/,True,124180,4.56228,4.57676,4.57676,38093,False,275,0,2017-05-08T13:03:21Z,2017-05-15T18:48:54Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
6,864146,Agile Crash Course: Agile Project Management; ...,/course/agile-crash-course/,True,96207,4.32383,4.29118,4.29118,30470,False,23,0,2016-05-30T22:57:40Z,2016-06-23T17:49:26Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
7,321410,Beginner to Pro in Excel: Financial Modeling a...,/course/beginner-to-pro-in-excel-financial-mod...,True,127680,4.54034,4.53346,4.53346,28665,False,275,0,2014-10-17T08:39:52Z,2014-11-25T23:00:40Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
8,673654,Become a Product Manager | Learn the Skills & ...,/course/become-a-product-manager-learn-the-ski...,True,112572,4.50386,4.5008,4.5008,27408,False,144,0,2015-11-18T19:35:12Z,2016-03-17T17:04:59Z,455.0,INR,₹455,8640.0,INR,"₹8,640"
9,1653432,The Business Intelligence Analyst Course 2020,/course/the-business-intelligence-analyst-cour...,True,115269,4.50067,4.49575,4.49575,23906,False,413,0,2018-04-19T07:00:09Z,2018-04-25T18:40:55Z,455.0,INR,₹455,8640.0,INR,"₹8,640"


Visualisation:
1. Average number of subscribers 
2. Correlation between the features (Heatmap)
3. Average rating Paid vs free course (Box plot) - discount or not
4. Avg_rating vs num_subcribers (Scatterplot)
5. Avg_rating vs num_of_test (Scatterplot)
6. Avg_rating vs num_of_lessons (Scatterplot)
7. Dashboard - Histogram for number of test | number of reviews | number of lectures | Price range
8. Wordmap


In [23]:
# To check for missing data
# Perform a loop through the data to get the percentage of completeness

for col in df.columns:
    pct_missing = np.mean(df[col].isnull())
    print(f'{col} - {pct_missing}%')

id - 0.0%
title - 0.0%
url - 0.0%
is_paid - 0.0%
num_subscribers - 0.0%
avg_rating - 0.0%
avg_rating_recent - 0.0%
rating - 0.0%
num_reviews - 0.0%
is_wishlisted - 0.0%
num_published_lectures - 0.0%
num_published_practice_tests - 0.0%
created - 0.0%
published_time - 0.0%
discount_price__amount - 0.08003325602765501%
discount_price__currency - 0.08003325602765501%
discount_price__price_string - 0.08003325602765501%
price_detail__amount - 0.021747691769133156%
price_detail__currency - 0.021747691769133156%
price_detail__price_string - 0.021747691769133156%


In [34]:
# To get more information about the data

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 22853 entries, 0 to 22852
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   id                            22853 non-null  int64  
 1   title                         22853 non-null  object 
 2   url                           22853 non-null  object 
 3   is_paid                       22853 non-null  bool   
 4   num_subscribers               22853 non-null  int64  
 5   avg_rating                    22853 non-null  float64
 6   avg_rating_recent             22853 non-null  float64
 7   rating                        22853 non-null  float64
 8   num_reviews                   22853 non-null  int64  
 9   is_wishlisted                 22853 non-null  bool   
 10  num_published_lectures        22853 non-null  int64  
 11  num_published_practice_tests  22853 non-null  int64  
 12  created                       22853 non-null  object 
 13  p

After doing the initial observation of the data, there are a few things that need to be done:

1. Changing created and published_time into the datetime format.
2. Expanding the datetime column into days, months and years column.
2. Add in zero for missing values in discounted price and price detail columns.
3. Add in additional columns for discounted price and price detail in SGD for easier reference. 
4. Drop off currency and price string columns as they are not useful.
5. Drop off wishlisted column as well as it all returns false (Not meaningful). 


In [26]:
df.columns

Index(['id', 'title', 'url', 'is_paid', 'num_subscribers', 'avg_rating',
       'avg_rating_recent', 'rating', 'num_reviews', 'is_wishlisted',
       'num_published_lectures', 'num_published_practice_tests', 'created',
       'published_time', 'discount_price__amount', 'discount_price__currency',
       'discount_price__price_string', 'price_detail__amount',
       'price_detail__currency', 'price_detail__price_string'],
      dtype='object')

In [41]:
df.is_wishlisted.value_counts()

False    22853
Name: is_wishlisted, dtype: int64

In [40]:
df.num_published_practice_tests.value_counts()

0    20111
2      949
3      424
6      374
1      366
4      350
5      279
Name: num_published_practice_tests, dtype: int64

In [42]:
df.describe()

Unnamed: 0,id,num_subscribers,avg_rating,avg_rating_recent,rating,num_reviews,num_published_lectures,num_published_practice_tests,discount_price__amount,price_detail__amount
count,22853.0,22853.0,22853.0,22853.0,22853.0,22853.0,22853.0,22853.0,21024.0,22356.0
mean,1818466.0,3205.448256,3.952356,3.937739,3.937739,270.277557,34.91721,0.375224,486.266077,4445.517982
std,927352.5,11051.296472,0.875152,0.888605,0.888605,2048.788093,48.65282,1.160939,234.100393,3098.531678
min,2762.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,455.0,1280.0
25%,1090694.0,76.0,3.75,3.73246,3.73246,8.0,11.0,0.0,455.0,1280.0
50%,1824268.0,559.0,4.15,4.14868,4.14868,27.0,22.0,0.0,455.0,3200.0
75%,2604580.0,2483.0,4.43548,4.43352,4.43352,98.0,41.0,0.0,455.0,8640.0
max,3486006.0,564444.0,5.0,5.0,5.0,188941.0,699.0,6.0,3200.0,12800.0


In [43]:
df.nunique()

id                              22853
title                           22750
url                             22853
is_paid                             2
num_subscribers                  6824
avg_rating                       3235
avg_rating_recent               20070
rating                          20070
num_reviews                      1750
is_wishlisted                       1
num_published_lectures            392
num_published_practice_tests        7
created                         22851
published_time                  22846
discount_price__amount             55
discount_price__currency            1
discount_price__price_string       55
price_detail__amount               37
price_detail__currency              1
price_detail__price_string         37
dtype: int64

In [None]:
data = data.rename(columns =

In [None]:
data['pd_amount'] = round(data['pd_amount']*0.014,2).to_list()