Hey there, so with the data cleaned up and ready. Below will be a list of questions im looking to tackle. The main goal for this project is to find a pattern/trend between any relevant attribute with the number of comments and ratings. I also wish to find identify demographics that use UWFlow less than others in order to focus on them for better engagement.

# Main questions

1. Are the demographic patterns between the departments the same? Do they have any influence in each other? i.e if a department has a high percentage of liked professors, do students tend to like the courses as well?

2. Rank departments on lowest to highest on the ratio between the number of students enrolled in the last few years against the number of ratings. i.e Number_of_Ratings/Course_Enrollment. 

3. What department has the best metrics for each metric? i.e who has the most liked courses on average and so on.

4. How different is the sentiment of the reviews between professors and the courses they teach?

5. What department/course year (100, 200, 300 etcs) has the most number of courses with less than x ratings?

I'll add more questions later on if I find anything. All the major takeaways will be published to a Power BI report

In [3]:
# Import all dependencies

import numpy as np
import re
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy.stats import boxcox, yeojohnson, skew
import warnings
warnings.filterwarnings('ignore')

In [58]:
# Read in the datasets
course_df = pd.read_csv("./data files/Cleaned/cleaned_course_data2.csv")
course_df = course_df.drop(course_df.columns[12: 19], axis=1)
course_df = course_df.drop(course_df.columns[0], axis=1)

prof_df = pd.read_csv("./data files/Cleaned/cleaned_prof_data2.csv")
prof_df = prof_df.drop(prof_df.columns[0], axis=1) # since presence of unnamed column

In [40]:
course_df.head(5)

Unnamed: 0,Course_Code,Course_Name,Number_of_Ratings,Number_of_Comments,Useful,Easy,Liked,Course_Reviews,Course_Enrollment,Department,Course_Level,Useful_percentage,Easy_percentage,Liked_percentage,Positive_Score,Neutral_Score,Negative_Score
0,CS 115,Introduction to Computer Science 1,2206,114,485,243,552,"['A bird course, easy to get 90+, but it is us...",4359,Math,100,22,11,25,0.519009,0.241392,0.239599
1,MATH 135,Algebra for Honours Mathematics,1555,338,1306,669,1213,"['Very easy and interesting course, no concept...",7597,Math,100,84,43,78,0.586315,0.256309,0.157375
2,ECON 101,Introduction to Microeconomics,1398,264,881,979,629,['you can just google everything but its just ...,6247,Math,100,63,70,45,0.380631,0.292072,0.327296
3,MATH 137,Calculus 1 for Honours Mathematics,1036,211,870,580,704,"['Easy course', 'The course itself is somewhat...",8237,Math,100,84,56,68,0.349131,0.356191,0.294678
4,PD 1,Career Fundamentals,1000,189,190,800,70,['The only effect of this course is to add pre...,5790,Coop,<100,19,80,7,0.170334,0.268717,0.560948


In [41]:
prof_df.head(5)

Unnamed: 0,Professor,Courses_Taught,Professor_Reviews,Liked_%,Clear,Engaging,Number_of_Comments,Number_of_Ratings,Department,Positive_Score,Neutral_Score,Negative_Score
0,Aakar Gupta,['CS 230'],"['TA was more clear and engaging', ""Doesn't re...",0.0,50.0,0.0,2,2,Math,0.356152,0.175943,0.467906
1,Aaron Hutchinson,['MATH 115'],"[""There aren't any lectures this term, so I ca...",100.0,100.0,100.0,1,2,Math,0.718039,0.260987,0.020974
2,Aaron Kay,"['PSYCH 253', 'PSYCH 395']",['By far the best prof ive ever had. He is a g...,0.0,0.0,0.0,9,0,Arts,0.884515,0.068476,0.047009
3,Aaron Smith,"['MATH 115', 'MATH 211', 'PMATH 467']","[""I believe our class was the first class he t...",66.666667,66.666667,100.0,4,3,Math,0.528319,0.20109,0.270591
4,Aazar Zafar,['AFM 273'],['Explained some concepts quite well using exa...,0.0,0.0,0.0,5,0,Math,0.681675,0.124219,0.194106


## Question 1

Are the demographic patterns between the departments the same? Do they have any influence in each other? i.e if a department has a high percentage of liked professors, do students tend to like the courses as well?

In [48]:
course_df.groupby("Department").mean().round(2).sort_values(by="Number_of_Ratings", ascending = False)

Unnamed: 0_level_0,Number_of_Ratings,Number_of_Comments,Useful,Easy,Liked,Course_Enrollment,Useful_percentage,Easy_percentage,Liked_percentage,Positive_Score,Neutral_Score,Negative_Score
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Coop,41.89,6.71,15.75,29.06,14.83,248.33,21.25,34.36,26.03,0.22,0.53,0.25
Math,34.6,7.21,24.95,18.11,22.17,237.55,33.0,22.85,33.21,0.25,0.53,0.22
Engineering,14.32,2.6,9.58,7.72,8.75,133.7,30.71,24.6,32.65,0.21,0.55,0.23
Science,13.66,2.94,9.95,8.36,9.12,100.04,31.85,26.54,33.82,0.23,0.55,0.22
Environment,7.95,1.76,5.49,5.25,5.22,69.7,29.45,26.46,32.89,0.23,0.56,0.21
Arts,4.58,1.07,2.73,3.31,3.16,32.67,20.81,21.3,27.47,0.2,0.59,0.21
Health,4.07,0.94,3.02,2.67,2.89,53.89,20.7,19.06,22.56,0.2,0.6,0.21


Lets focus on undergraduate classes.

In [61]:
course_df[course_df["Course_Level"].isin(["<100", "100", "200", "300", "400"])].groupby("Department").mean().round(2).sort_values(by="Number_of_Ratings", ascending = False)

Unnamed: 0_level_0,Number_of_Ratings,Number_of_Comments,Useful,Easy,Liked,Course_Enrollment,Useful_percentage,Easy_percentage,Liked_percentage,Positive_Score,Neutral_Score,Negative_Score
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Math,57.35,11.97,41.36,30.03,36.73,389.12,49.93,36.07,48.82,0.34,0.44,0.22
Coop,42.89,6.87,16.13,29.75,15.18,254.24,21.75,35.17,26.65,0.22,0.52,0.25
Engineering,22.85,4.07,15.24,12.33,13.89,207.78,42.72,34.64,44.15,0.27,0.49,0.24
Science,19.59,4.21,14.26,11.98,13.07,142.55,44.48,36.99,47.19,0.29,0.48,0.22
Environment,11.12,2.46,7.68,7.35,7.31,97.53,41.21,37.03,46.03,0.29,0.5,0.21
Health,6.74,1.57,5.01,4.42,4.8,89.43,33.91,31.41,36.99,0.27,0.53,0.2
Arts,5.13,1.2,3.06,3.71,3.53,36.5,23.21,23.76,30.63,0.21,0.58,0.21


Lets see how the data is affected by the number of courses with no ratings

In [60]:
course_df[(course_df["Number_of_Ratings"] == 0) & (course_df["Course_Level"].isin(["<100", "100", "200", "300", "400"]))].groupby("Department").count()["Course_Code"].sort_values(ascending=False)

Department
Arts           1437
Engineering     277
Science         232
Health          224
Math            211
Environment     105
Coop             52
Name: Course_Code, dtype: int64

From here we can see that Arts is heavily affected

Let's filter by courses with more than the average number of ratings for better results

In [66]:
mean_ratings = course_df["Number_of_Ratings"].mean()
undergrad_data = course_df[course_df["Course_Level"].isin(["<100", "100", "200", "300", "400"])].copy()

undergrad_data[undergrad_data["Number_of_Ratings"] > mean_ratings].groupby("Department").mean().round(2).sort_values(by="Number_of_Ratings", ascending = False)

Unnamed: 0_level_0,Number_of_Ratings,Number_of_Comments,Useful,Easy,Liked,Course_Enrollment,Useful_percentage,Easy_percentage,Liked_percentage,Positive_Score,Neutral_Score,Negative_Score
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Math,152.62,31.58,110.16,80.08,97.5,993.57,72.32,52.56,66.24,0.5,0.25,0.25
Coop,141.97,23.14,52.97,98.97,49.51,859.54,33.14,65.32,32.68,0.34,0.29,0.37
Science,82.68,18.11,60.51,50.68,54.73,539.39,74.25,60.82,69.21,0.5,0.25,0.25
Engineering,68.34,12.33,45.95,36.9,41.17,518.49,70.09,55.63,62.6,0.45,0.28,0.27
Arts,54.77,13.7,31.9,40.5,36.22,338.32,62.97,71.78,69.78,0.53,0.23,0.24
Environment,54.23,12.62,37.08,36.52,34.79,423.5,65.96,66.71,63.17,0.54,0.21,0.25
Health,37.31,8.49,28.05,24.12,26.31,374.09,75.05,64.09,69.43,0.6,0.21,0.19


Immediately notice the difference between the sentiment scores. When filtered by all reviews, the neutral score used to be the dominant one. However now it's always the positive score being dominant, with the exception of the Coop department. This shows how different the data can be, and how more accurate it can be when more people participate in reviewing

#### For further clarity, let's filter it down with courses with more than 50, 70, 100 reviews

In [68]:
undergrad_data[undergrad_data["Number_of_Ratings"] > 50].groupby("Department").mean().round(2).sort_values(by="Number_of_Ratings", ascending = False)

Unnamed: 0_level_0,Number_of_Ratings,Number_of_Comments,Useful,Easy,Liked,Course_Enrollment,Useful_percentage,Easy_percentage,Liked_percentage,Positive_Score,Neutral_Score,Negative_Score
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Math,275.47,55.48,199.07,144.99,175.31,1616.31,73.13,54.73,65.4,0.49,0.25,0.26
Science,191.46,41.15,138.83,117.65,124.04,1106.0,71.62,61.02,65.77,0.52,0.23,0.24
Coop,188.65,31.08,71.35,133.31,66.38,1102.77,34.65,69.77,34.46,0.31,0.27,0.42
Environment,146.73,35.0,102.73,97.18,93.45,1017.27,71.91,62.55,63.45,0.58,0.2,0.22
Engineering,125.18,21.79,82.44,66.71,74.02,792.38,68.78,54.62,61.13,0.44,0.27,0.3
Arts,115.33,28.35,64.42,87.42,73.46,661.53,60.46,75.23,66.21,0.51,0.24,0.26
Health,88.75,19.83,66.17,58.92,66.17,676.75,72.42,68.17,72.92,0.6,0.19,0.2


In [69]:
undergrad_data[undergrad_data["Number_of_Ratings"] > 70].groupby("Department").mean().round(2).sort_values(by="Number_of_Ratings", ascending = False)

Unnamed: 0_level_0,Number_of_Ratings,Number_of_Comments,Useful,Easy,Liked,Course_Enrollment,Useful_percentage,Easy_percentage,Liked_percentage,Positive_Score,Neutral_Score,Negative_Score
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Math,322.26,64.23,232.35,169.97,204.05,1844.06,72.48,55.7,64.13,0.49,0.25,0.26
Science,217.58,46.6,157.57,132.57,139.82,1226.28,71.08,59.1,63.92,0.51,0.24,0.25
Coop,211.95,34.64,82.05,149.73,74.77,1058.64,37.32,69.77,34.82,0.3,0.28,0.42
Environment,167.22,39.33,116.33,112.11,105.44,1040.56,71.11,64.11,61.56,0.59,0.19,0.22
Engineering,151.99,26.68,98.58,80.92,88.08,875.84,67.58,54.96,59.01,0.42,0.27,0.32
Arts,143.13,34.34,75.92,107.97,88.26,798.18,55.89,74.16,62.68,0.48,0.24,0.28
Health,98.44,22.78,73.89,63.11,74.33,760.89,72.78,64.78,74.0,0.61,0.2,0.2


In [70]:
undergrad_data[undergrad_data["Number_of_Ratings"] > 100].groupby("Department").mean().round(2).sort_values(by="Number_of_Ratings", ascending = False)

Unnamed: 0_level_0,Number_of_Ratings,Number_of_Comments,Useful,Easy,Liked,Course_Enrollment,Useful_percentage,Easy_percentage,Liked_percentage,Positive_Score,Neutral_Score,Negative_Score
Department,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
Math,404.68,79.1,293.08,211.84,257.23,2252.58,74.37,54.72,65.71,0.49,0.26,0.25
Coop,260.25,43.12,102.19,183.0,92.12,1319.0,37.94,68.19,34.56,0.32,0.28,0.41
Science,255.61,54.77,184.29,158.42,164.1,1476.87,69.9,61.65,63.71,0.49,0.24,0.27
Engineering,214.86,36.86,138.36,111.0,124.29,1072.86,68.38,52.07,59.31,0.42,0.26,0.32
Environment,206.83,48.67,141.67,138.5,131.5,1292.0,69.67,62.67,62.17,0.62,0.2,0.18
Arts,179.62,42.88,94.79,135.42,109.67,1028.46,56.83,73.29,62.04,0.47,0.24,0.29
Health,117.25,25.0,102.75,57.25,95.0,898.25,88.25,44.25,80.25,0.7,0.18,0.11
