<p style="text-align:center">
    <a href="https://skills.network" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Data Visualization**


Estimated time needed: **30** minutes


In this lab, you will learn how to visualize and interpret data


## Objectives


* Import Libraries
* Lab Exercises
    * Identifying duplicates
    * Plotting Scatterplots
    * Plotting Boxplots


----


## Import Libraries


All Libraries required for this lab are listed below. The libraries pre-installed on Skills Network Labs are commented. If you run this notebook in a different environment, e.g. your desktop, you may need to uncomment and install certain libraries.


In [ ]:
#install specific version of libraries used in lab
!pip install pandas
!pip install numpy
!pip install scipy
!pip install seaborn
!pip install matplotlib

Import the libraries we need for the lab


In [ ]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt 

Read in the csv file from the url using the request library


In [ ]:
ratings_url = 'https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-ST0151EN-SkillsNetwork/labs/teachingratings.csv'
ratings_df = pd.read_csv(ratings_url)

## Lab Exercises


###  Identify all duplicate cases using prof. Using all observations, find the average and standard deviation for age. Repeat the analysis by first filtering the data set to include one observation for each instructor with a total number of observations restricted to 94.


Identify all duplicate cases using prof variable - find the unique values of the prof variables


In [ ]:
ratings_df.prof.unique()

Print out the number of unique values in the prof variable


In [ ]:
ratings_df.prof.nunique()

Using all observations, Find the average and standard deviation for age


In [ ]:
ratings_df['age'].mean()

In [ ]:
ratings_df['age'].std()

Repeat the analysis by first filtering the data set to include one observation for each instructor with a total number of observations restricted to 94.
> first we drop duplicates using prof as a subset and assign it a new dataframe name called no_duplicates_ratings_df


In [ ]:
no_duplicates_ratings_df = ratings_df.drop_duplicates(subset =['prof'])
no_duplicates_ratings_df.head()

> Use the new dataset to get the mean of age


In [ ]:
no_duplicates_ratings_df['age'].mean()

In [ ]:
no_duplicates_ratings_df['age'].std()

### Using a bar chart, demonstrate if instructors teaching lower-division courses receive higher average teaching evaluations.


In [ ]:
ratings_df.head()

Find the average teaching evaluation in both groups of upper and lower-division


In [ ]:
division_eval = ratings_df.groupby('division')[['eval']].mean().reset_index()

Plot the barplot using the seaborn library


In [ ]:
sns.set(style="whitegrid")
ax = sns.barplot(x="division", y="eval", data=division_eval)

### Plot the relationship between age and teaching evaluation scores.


Create a scatterplot with the scatterplot function in the seaborn library


In [ ]:
ax = sns.scatterplot(x='age', y='eval', data=ratings_df)

### Using gender-differentiated scatter plots, plot the relationship between age and teaching evaluation scores.


Create a scatterplot with the scatterplot function in the seaborn library this time add the <code>hue</code> argument


In [ ]:
ax = sns.scatterplot(x='age', y='eval', hue='gender',
                     data=ratings_df)

### Create a box plot for beauty scores differentiated by credits.


We use the <code>boxplot()</code> function from the seaborn library


In [ ]:
ax = sns.boxplot(x='credits', y='beauty', data=ratings_df)

### What is the number of courses taught by gender?


We use the <code>catplot()</code> function from the seaborn library


In [ ]:
sns.catplot(x='gender', kind='count', data=ratings_df)

### Create a group histogram of taught by gender and tenure


We will add the <code>hue = Tenure</code> argument


In [ ]:
sns.catplot(x='gender', hue = 'tenure', kind='count', data=ratings_df)

### Add division as another factor to the above histogram


We add another argument named <code>row</code> and use the division variable as the row


In [ ]:
sns.catplot(x='gender', hue = 'tenure', row = 'division',
            kind='count', data=ratings_df,
            height = 3, aspect = 2)

### Create a scatterplot of age and evaluation scores, differentiated by gender and tenure


Use the <code>relplot()</code> function for complex scatter plots


In [ ]:
sns.relplot(x="age", y="eval", hue="gender",
            row="tenure",
            data=ratings_df, height = 3, aspect = 2)

### Create a distribution plot of teaching evaluation scores


We use the <code>distplot()</code> function from the seaborn library, set <code>kde = false</code> because we don'e need the curve


In [ ]:
ax = sns.distplot(ratings_df['eval'], kde = False)

### Create a distribution plot of teaching evaluation score with gender as a factor


In [ ]:
## use the distplot function from the seaborn library
sns.distplot(ratings_df[ratings_df['gender'] == 'female']['eval'], color='green', kde=False) 
sns.distplot(ratings_df[ratings_df['gender'] == 'male']['eval'], color="orange", kde=False) 
plt.show()

### Create a box plot - age of the instructor by gender


In [ ]:
ax = sns.boxplot(x="gender", y="age", data=ratings_df)

### Compare age along with tenure and gender


In [ ]:
ax = sns.boxplot(x="tenure", y="age", hue="gender",
                 data=ratings_df)

## Practice Questions


### Question 1: Create a distribution plot of beauty scores with Native English speaker as a factor
* Make the color of the native English speakers plot - orange and non - native English speakers - blue


In [ ]:
## insert code


Double-click **here** for the solution.

<!-- The answer is below:
sns.distplot(ratings_df[ratings_df['native'] == 'yes']['beauty'], color="orange", kde=False) 
sns.distplot(ratings_df[ratings_df['native'] == 'no']['beauty'], color="blue", kde=False) 
plt.show()
-->


### Question 2: Create a Horizontal box plot of the age of the instructors by visible minority


In [ ]:
## insert code


Double-click **here** for a hint.

<!-- The hint is below:
Remember that the positions of the argument determine whether it will be vertical or horizontal
-->


Double-click **here** for the solution.

<!-- The answer is below:
ax = sns.boxplot(x="age", y="minority", data=ratings_df)
-->


### Question 3: Create a group histogram of tenure by minority and add the gender factor


In [ ]:
## insert code

Double-click **here** for the solution.

<!-- The answer is below:
sns.catplot(x='tenure', hue = 'minority', row = 'gender',
            kind='count', data=ratings_df,
            height = 3, aspect = 2)
-->


### Question 4: Create a boxplot of the age variable


In [ ]:
## insert code

Double-click **here** for the solution.

<!-- The answer is below:
## you only habve to specify the y-variable
ax = sns.boxplot(y="age", data=ratings_df)
-->


## Authors


[Aije Egwaikhide](https://www.linkedin.com/in/aije-egwaikhide/) is a Data Scientist at IBM who holds a degree in Economics and Statistics from the University of Manitoba and a Post-grad in Business Analytics from St. Lawrence College, Kingston. She is a current employee of IBM where she started as a Junior Data Scientist at the Global Business Services (GBS) in 2018. Her main role was making meaning out of data for their Oil and Gas clients through basic statistics and advanced Machine Learning algorithms. The highlight of her time in GBS was creating a customized end-to-end Machine learning and Statistics solution on optimizing operations in the Oil and Gas wells. She moved to the Cognitive Systems Group as a Senior Data Scientist where she will be providing the team with actionable insights using Data Science techniques and further improve processes through building machine learning solutions. She recently joined the IBM Developer Skills Network group where she brings her real-world experience to the courses she creates.


## Change Log


|  Date (YYYY-MM-DD) |  Version | Changed By  |  Change Description |
|---|---|---|---|
| 2020-08-14  | 0.1  | Aije Egwaikhide  |  Created the initial version of the lab |


 Copyright &copy; 2020 IBM Corporation. This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).
