# Project 1: Part 3 - Data Analysis and Visualization
## Shreya Kamath
## 7/22/2025
### Answering my data study questions using Matplotlib and survey data cleaned with Pandas

### Data study questions to answer:
#### 1) Of the five courses surveyed, which course(s) resulted in the highest amounts of interest in taking other computing courses?
#### 2) Are older (>=25) students more likely to be enrolled in a degree-granting program, or a certification of achievement?
#### 3) Are women more likely to enroll in CMP 128 or CMP 131?
#### 4) What are the most common ways students learned about the County College of Morris?

### Step 1 - Installing and importing the necessary libraries

In [None]:
!pip install matplotlib

In [None]:
import matplotlib.pyplot as plt

In [None]:
!pip install pandas

In [None]:
import pandas as pd

### Step 2 - Reading the .csv file of cleaned data into a dataframe

In [None]:
df = pd.read_csv('cleaned_survey.csv')

### Question 1: Of the five courses surveyed, which course(s) resulted in the highest amounts of interest in taking other computing courses?

### Step 1a - Creating arrays with the interest levels in future computing courses generated by each course

#### Note - The arrays all have a length of 5 to correspond with the 5 possible ratings Eg: The value in the 1st Position in the array corresponds to the number of '1-Not Interested' responses.

In [None]:
count_120 = []
for i in range(1, 6):
    #Explanation: Of all the records in which CMP 120 was selected, count the number of instances of the 'i' number in the interest level column
    ct = df[df['course'] == 'CMP 120 Foundations of Information Security']['interest'].value_counts().get(i, 0)
    count_120.append(int(ct))

In [None]:
count_128 = []
for i in range(1, 6):
    #Explanation: Of all the records in which CMP 128 was selected, count the number of instances of the 'i' number in the interest level column
    ct = df[df['course'] == 'CMP 128 Computer Science I']['interest'].value_counts().get(i, 0)
    count_128.append(int(ct))

In [None]:
count_130 = []
for i in range(1, 6):
    #Explanation: Of all the records in which CMP 130 was selected, count the number of instances of the 'i' number in the interest level column
    ct = df[df['course'] == 'CMP 130 Intro to IT']['interest'].value_counts().get(i, 0)
    count_130.append(int(ct))

In [None]:
count_131 = []
for i in range(1, 6):
    #Explanation: Of all the records in which CMP 131 was selected, count the number of instances of the 'i' number in the interest level column
    ct = df[df['course'] == 'CMP 131 Fundamentals of Programming (Python)']['interest'].value_counts().get(i, 0)
    count_131.append(int(ct))

In [None]:
count_239 = []
for i in range(1, 6):
    #Explanation: Of all the records in which CMP 239 was selected, count the number of instances of the 'i' number in the interest level column
    ct = df[df['course'] == 'CMP 239 Internet & Web Page Design']['interest'].value_counts().get(i, 0)
    count_239.append(int(ct))

#### Source: I used a Kaggle and a Vultr tutorial to learn the .value_counts() and .get() methods to help me be able to count the instances of each rating number per course. 
#### I then prompted ChatGPT to explain how I could use these features together, where I learned about using the index and 0 as a 'True' value in the arguments for the .get function to make the counting process easier
#### The idea to use this function in a for loop and then append the number of instances to the array was my idea though!
#### Links: https://docs.vultr.com/python/third-party/pandas/Series/get, https://www.kaggle.com/code/parulpandey/five-ways-to-use-value-counts

### Step 2a - Creating an array filled with labels for my pie charts

In [None]:
labels = ['1 - Not Interested', '2 - Slightly Interested', '3 - Neutral', '4 - Interested', '5 - Extremely Interested']

### Step 3a - Creating one figure with pie charts illustrating an interest-level breakdown for each course surveyed

In [None]:
plt.figure(figsize=(20, 13))

#Figure 1
plt.subplot(2, 3, 1)
plt.pie(count_120, autopct='%1.1f%%')
plt.legend(labels)
plt.title('CMP 120: Interest in Enrolling in More Computing Classes')

#Figure 2
plt.subplot(2, 3, 2)
plt.pie(count_128, autopct='%1.1f%%')
plt.legend(labels)
plt.title('CMP 128: Interest in Enrolling in More Computing Classes')

#Figure 3
plt.subplot(2, 3, 3)
plt.pie(count_130, autopct='%1.1f%%')
plt.legend(labels)
plt.title('CMP 130: Interest in Enrolling in More Computing Classes')

#Figure 4
plt.subplot(2, 3, 4)
plt.pie(count_131, autopct='%1.1f%%')
plt.legend(labels)
plt.title('CMP 131: Interest in Enrolling in More Computing Classes')

#Figure 5
plt.subplot(2, 3, 5)
plt.pie(count_239, autopct='%1.1f%%')
plt.legend(labels)
plt.title('CMP 239: Interest in Enrolling in More Computing Classes')

#### Results of Question 1: CMP 131 is the course that resulted in the highest levels of interest in future computing courses.
#### Based on these results, the CCM Computing Department should look over the survey and try to develop solutions to the reasons why students were not interested in taking future computing classes after CMP 120 and 130, as these classes had the lowest levels of 'Extreme Interest'

### Question 2: Are older (>=25) students more likely to be enrolled in a degree-granting program, or a certification of achievement?

### Part 1b - Filtering the data to only show surveys submitted by students over the age of 25

In [None]:
age_filter = df[(df['age'] == '25-34') | (df['age'] == '35-64') | (df['age'] == '65+')]

### Part 2b - Counting the number of students who are enrolled in a listed certification of achievement program

In [None]:
age_certificate = int(((age_filter['major'] == 'Information Security Certificate of Achievement') |(age_filter['major'] == 'Web Development Certificate of Achievement') |(age_filter['major'] == 'Data Analytics Certificate of Achievement')).sum())

### Part 3b - Counting the number of students who are enrolled in a degree-granting program

In [None]:
age_degree = int(((age_filter['major'] != 'Information Security Certificate of Achievement') |(age_filter['major'] != 'Web Development Certificate of Achievement') |(age_filter['major'] != 'Data Analytics Certificate of Achievement') |(age_filter['major'] != 'Challenger Program') |(age_filter['major'] != 'ShareTime CSIP Program') |(age_filter['major'] != 'Non_Degree Seeking') ).sum())

### Part 4b - Creating a bar graph to compare the two results

In [None]:
plt.figure(figsize = (10, 5))
plt.bar(['Certification', 'Degree'], [age_certificate, age_degree])
plt.title('Enrollment of Students Aged 25+ in Degree vs. Certificate Programs')
plt.xlabel('Program')
plt.ylabel('Count')

#### Results of Question 2: Students aged 25 and older are more likely to be enrolled in a degree-program rather than a certification
#### Based on these results, the CCM Computing department should work harder to market certification programs to students aged 25+, as these individuals are more likely to be in the workforce and looking for a lower commitment program that they can use to develop certain skills compared to a full blown degree

### Question 3: Are women more likely to enroll in CMP 128 or CMP 131?

### Step 1c - Filtering the data to only show results of surveys submitted by women

In [None]:
female_data = df[df['gender'] == 'Woman']

### Step 2c - Counting the number of women in each course

In [None]:
women_128 = int((female_data['course'] == 'CMP 128 Computer Science I').sum())

In [None]:
women_131 = int((female_data['course'] == 'CMP 131 Fundamentals of Programming (Python)').sum())

In [None]:
women_130 = int((female_data['course'] == 'CMP 130 Intro to IT').sum())

In [None]:
women_120 = int((female_data['course'] == 'CMP 120 Foundations of Information Security').sum())

In [None]:
women_239 = int((female_data['course'] == 'CMP 239 Internet & Web Page Design').sum())

### Step 3c - Creating a bar graph to illustrate the female enrollment by course

In [None]:
plt.bar(['CMP 120','CMP 128','CMP 130', 'CMP 131', 'CMP 239'], [women_120, women_128, women_130, women_131, women_239])
plt.title('Female Enrollment by Course')
plt.xlabel('Course Title')
plt.ylabel('Enrollment')

#### Results of Question 3: Women are more likely to be enrolled in CMP 128 compared to CMP 131
#### Based on these results, the CCM Computing Department should work harder to market the Data Science/Analytics programs to women, as these are programs that rely on Python as a fundamental language, instead of the Computer Science program (which relies more on Java)

### Question 4: What are the most common ways students learned about the County College of Morris?

### Step 1d - Count the number of 'yes' response to each listed source of information

In [None]:
learn_website = int((df['hear_website'] == 'Yes').sum())

In [None]:
learn_social = int((df['hear_social'] == 'Yes').sum())

In [None]:
learn_community = int((df['hear_community'] == 'Yes').sum())

In [None]:
learn_family = int((df['hear_family'] == 'Yes').sum())

In [None]:
learn_student = int((df['hear_student'] == 'Yes').sum())

In [None]:
learn_alumni = int((df['hear_alumni'] == 'Yes').sum())

In [None]:
learn_teacher = int((df['hear_teacher'] == 'Yes').sum())

In [None]:
learn_counselor = int((df['hear_counselor'] == 'Yes').sum())

In [None]:
learn_app = int((df['hear_app'] == 'Yes').sum())

In [None]:
learn_employer = int((df['hear_employer'] == 'Yes').sum())

In [None]:
learn_billboard = int((df['hear_billboard'] == 'Yes').sum())

In [None]:
learn_tv = int((df['hear_tv'] == 'Yes').sum())

In [None]:
learn_radio = int((df['hear_radio'] == 'Yes').sum())

In [None]:
learn_other = int((df['hear_other'] == 'Yes').sum())

### Step 2d - Create a bar graph to illustrate the most common ways students learn about CCM

In [None]:
plt.figure(figsize=(17, 5)) 
plt.bar(['Website', 'Socials', 'Community', 'Friends/Fam', 'Student', 'Alumni', 'Teacher', 'Counselor', 'App', 'Employer', 'Billboard', 'TV', 'Radio', 'Other'], 
        [learn_website, learn_social, learn_community, learn_family, learn_student, learn_alumni, learn_teacher, learn_counselor, learn_app, learn_employer, learn_billboard, learn_tv, learn_radio, learn_other])
plt.title('Ways Students Learned About County College of Morris')
plt.xlabel('Sources')
plt.ylabel('Count')

#### Results of Question 4: The most common way for students to learn about CCM is through their friends and family, with other major sources of information coming from their high schools, the schoool website, and current students
#### Based on this information, CCM could spend less of their efforts marketing through traditional avenues (app, TV, radio advertisements, etc.) as most of their advertising comes for free through word of mouth