<a href="https://colab.research.google.com/github/srijac9/Machine-Learning-and-Data-Visualization-Course/blob/main/2024_02_19_SrijaChitturi_Project95.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Instructions

#### Goal of the Project

This project is designed for you to practice and solve the activities that are based on the concepts covered in the lessons:

  * Streamlit Framework I
  * Streamlit Framework II
  * Streamlit Widgets I
  * Streamlit Widgets II

---

#### Getting Started:

1. Follow the next 3 steps to create a copy of this colab file and start working on the project.

2. Create a duplicate copy of the Colab file as described below.

  - Click on the **File menu**. A new drop-down list will appear.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-0/0_file_menu.png' width=500>

  - Click on the **Save a copy in Drive** option. A duplicate copy will get created. It will open up in the new tab on your web browser.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/lesson-0/1_create_colab_duplicate_copy.png' width=500>

3. After creating the duplicate copy of the notebook, please rename it in the **YYYY-MM-DD_StudentName_Project95** format.

4. Now, write your code in the prescribed code cells.


---

#### Problem Statement

In this project, you are going to create a Census Data Visualization Web app using the Streamlit framework to display the 5 specific conditions given in **Activity 2**.



---

### Dataset Description

The dataset includes 32561 instances with 14 features and 1 target column which can be briefed as:

|Field|Description|
|---:|:---|
|age|age of the person, Integer.|
|work-class| employment information about the individual, Categorical.|
|fnlwgt| unknown weights, Integer.|
|education| highest level of education obtained, Categorical.|
|education-years|number of years of education, Integer.|
|marital-status| marital status of the person, Categorical.|
|occupation|job title, Categorical.|
|relationship| individual relation in the family-like wife, husband, and so on, Categorical.|
|race|Categorical.|
|sex| gender, Male, or Female.|
|capital-gain| gain from sources other than salary/wages, Integer.|
|capital-loss| loss from sources other than salary/wages, Integer.|
|hours-per-week| hours worked per week, Integer.|
|native-country| name of the native country, Categorical.|
|income-group| annual income, Categorical,  **<=50k** or **>50k**.|


**Notes:**
1. The dataset has no header row for the column name. (Can add column names manually)
2. There are invalid values in the dataset marked as **"?"**.
3. As the information about **fnlwgt** is non-existent it can be removed before model training.
4. Take note of the **whitespaces (" ")**  throughout the dataset.



**Dataset Credits:** https://archive.ics.uci.edu/ml/datasets/adult

**Dataset Creator:**
```
Dua, D., & Graff, C.. (2017). UCI Machine Learning Repository.
```

---

### List of Activities

**Activity 1:** Filter Streamlit Warnings
  
**Activity 2:** Design the Visualisation Web App

---

#### Creating Python File for the Visualisation Web App


In this activity, you have to create a Python file `census_app.py` in Sublime editor and save it in the `Python_scripts` folder.

Copy the code given below in the `cenus_app.py` file. You are already aware of this code which creates a function that will load the data from the csv file.

**Dataset Link:** https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/adult.csv

**Note:** Do not run the code shown below. It will throw an error.


In [None]:
# Open Sublime text editor, create a new Python file, copy the following code in it and save it as 'census_app.py'.

# Import modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import streamlit as st

@st.cache()
def load_data():
	# Load the Adult Income dataset into DataFrame.

	df = pd.read_csv('https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/whitehat-ds-datasets/adult.csv', header=None)
	df.head()

	# Rename the column names in the DataFrame using the list given above.

	# Create the list
	column_name =['age', 'workclass', 'fnlwgt', 'education', 'education-years', 'marital-status', 'occupation', 'relationship', 'race','gender','capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

	# Rename the columns using 'rename()'
	for i in range(df.shape[1]):
	  df.rename(columns={i:column_name[i]},inplace=True)

	# Print the first five rows of the DataFrame
	df.head()

	# Replace the invalid values ' ?' with 'np.nan'.

	df['native-country'] = df['native-country'].replace(' ?',np.nan)
	df['workclass'] = df['workclass'].replace(' ?',np.nan)
	df['occupation'] = df['occupation'].replace(' ?',np.nan)

	# Delete the rows with invalid values and the column not required

	# Delete the rows with the 'dropna()' function
	df.dropna(inplace=True)

	# Delete the column with the 'drop()' function
	df.drop(columns='fnlwgt',axis=1,inplace=True)

	return df

census_df = load_data()

---

#### Activity 1: Filter Streamlit Warnings

Filter the warnings for streamlit using `st.set_option('deprecation.showPyplotGlobalUse', False)`.

In [None]:
# Write your code to filter streamlit warnings
st.set_option('deprecation.showPyplotGlobalUse', False)

**After this step, the Python file should be created on the local system with the function to load the data from the csv data file.**

---

####Activity 2: Design the Visualisation Web App

In this activity, you have to design the user interface of the web app that displays the graphs for the following conditions:

1. Create a chart/plot to display the distribution of records for the `income-group` feature.

2. Create a chart/plot to display the distribution of records for the `gender` feature.

3. Create a chart/plot to display the difference in the range of values for the `hours-per-week` feature for different income groups.

4. Create a chart/plot to display the difference in the range of values for the `hours-per-week` features for different gender groups.

5. Create a chart/plot to display the count of a number of records for unique `workclass` feature values for different income groups.


**Steps:**

1. Add the title to the app.

2.  Create a slider to display the **Menu**.

3. Add an option to display the raw data in the Menu.

4. Add a multiselect to display the selection of the chart/plot options available.

  **Note:** The chart/plot options displayed should be corresponding to the 5 conditions asked above.


In [None]:
# Write the code to design the web app

# Add title on the main page and in the sidebar.
st.title("Census Visualization App")

# Using the 'if' statement, display raw data on the click of the checkbox.
if st.sidebar.checkbox("Show raw data"):
    st.subheader("Census Dataset")
    st.dataframe(census_df)

# Add a multiselect widget to allow the user to select multiple visualisations.
# Add a subheader in the sidebar with the label "Visualisation Selector"
st.sidebar.subheader("Visualisation Selector")

# Add a multiselect in the sidebar with label 'Select the Charts/Plots:'
# Store the current value of this widget in a variable 'plot_list'.
plot_list = st.sidebar.multiselect("Select the Charts/Plots:",
                                            ('Distribution of Records for Income-group', 'Distribution of Records for Gender', 'Difference in Values for Weekly Hours for Income Group','Difference in Values for Weekly Hours for Gender','Number of Records for Workclass of Different Income Groups'))

# Display pie plot using matplotlib module and 'st.pyplot()'
if 'Distribution of Records for Income-group' in plot_list:
  st.subheader('Pie Charts')
  slices = census_df['income'].value_counts()
  plt.figure(figsize = (10,5))
  plt.title(f'Pie chart for values in Income-Group')
  plt.pie(slices, labels = slices.index)
  st.pyplot()

if 'Distribution of Records for Gender' in plot_list:
  st.subheader('Pie Charts')
  slices = census_df['gender'].value_counts()
  plt.figure(figsize = (10,5))
  plt.title(f'Pie chart for values in Gender')
  plt.pie(slices, labels = slices.index)
  st.pyplot()

# Display box plot using matplotlib module and 'st.pyplot()'

if 'Difference in Values for Weekly Hours for Income Group' in plot_list:
  st.subheader('Box Plots')
  plt.figure(figsize = (10,5))
  plt.title(f'Box plot for Weekly Hours vs Income')
  sns.boxplot(data = census_df, x = 'hours-per-week', y= 'income')
  st.pyplot()

if 'Difference in Values for Weekly Hours for Income Group' in plot_list:
  st.subheader('Box Plots')
  plt.figure(figsize = (10,5))
  plt.title(f'Box plot for Weekly Hours vs Gender')
  sns.boxplot(data = census_df, x = 'hours-per-week', y= 'gender')
  st.pyplot()

# Display count plot using seaborn module and 'st.pyplot()'

if 'Number of Records for Workclass of Different Income Groups' in plot_list:
  st.subheader('Count Plot')
  plt.figure(figsize = (10,5))
  plt.title(f'Count plot for WorkClass')
  sns.countplot(x=census_df['workclass'], hue = census_df['income'])
  st.pyplot()



**Note:** Perform the tasks in `census_app.py` Python file in **Sublime editor** and run your code using command prompt or terminal. Once you get the desired output, write the code in the code section given above.

**Hint:** The app should look something like this:
<center><img src= 'https://s3-whjr-v2-prod-bucket.whjr.online/ca325ae8-116a-4f0b-98ec-7c028ef1525c.gif' > </center>



---

**Questions:**

**Q:** Which chart/plot can be used to display the distribution of records for the `income-group` feature.

**A:**  Pie Chart

**Q:** Which chart/plot can be used to display the distribution of records for the `gender` feature.

**A:** Pie Chart

**Q:** Which chart/plot can be used to display the difference in the range of values for the `hours-per-week` feature for different income groups.

**A:** Box Plot

**Q:** Which chart/plot can be used to display the difference in the range of values for the `hours-per-week` feature for different gender groups.

**A:** Box Plot

**Q:** Which chart/plot can be used to display the count of a number of records for unique `workclass` feature values for different income groups.

**A:** Count Plot


**After this activity, the Visualization web app should be ready for show-case. Please upload the required files in github and submit the link of the github repository**

---

**Write your interpretation of the results of the charts/plots here.**

- Interpretation 1: A majority of the people in the census earn lower than 50k(It seems like 3/4 earn lower than 50k)

- Interpretation 2: A majority of the census was done by males(It seems like around 2/3 was male)

- Interpretation 3: Those who gain more than 50k work a lot more hours then those who earn less than 50k (The median for below 50k is around 40 hours per week, and the media for above 50k is around 45 hours per week. But the maximum value differs a lot)

- Interpretation 4: The males in the census work more hours on average than the females in the census

- Interpretation 5: A lot of people work private




---

### Submitting the Project:

1. After finishing the project, click on the **Share** button on the top right corner of the notebook. A new dialog box will appear.

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/2_share_button.png' width=500>

2. In the dialog box, make sure that '**Anyone on the Internet with this link can view**' option is selected and then click on the **Copy link** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/3_copy_link.png' width=500>

3. The link of the duplicate copy (named as **YYYY-MM-DD_StudentName_Project95**) of the notebook will get copied.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/4_copy_link_confirmation.png' width=500>

4. Go to your dashboard and click on the **My Projects** option.
   
   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/5_student_dashboard.png' width=800>

  <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/6_my_projects.png' width=800>

5. Click on the **View Project** button for the project you want to submit.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/7_view_project.png' width=800>

6. Click on the **Submit Project Here** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/8_submit_project.png' width=800>

7. Paste the link to the project file named as **YYYY-MM-DD_StudentName_Project95** in the URL box and then click on the **Submit** button.

   <img src='https://student-datasets-bucket.s3.ap-south-1.amazonaws.com/images/project-share-images/9_enter_project_url.png' width=800>

---