Problem 1: Pandas Data Frames (3 Points)

Write a python script that reads adult.data into a data frame. Then print the number of rows in the data frame. 

Notice that this file does not have column headers. You will need to specify the header=None option inside the read_csv() function.

To find the number of rows, you can use the count() function. However, this will count the number of values within each column! To overcome that, select the first row: df[0] and then call the count() function for this single column as follows: df[0].count()

In [1]:
import pandas as pd

# Read the data into a DataFrame with header=None
df = pd.read_csv('adult.data', header=None)

# Print the number of rows in the DataFrame
rowCount = df[0].count()
print(f"Number of rows in the DataFrame: {rowCount}")

Number of rows in the DataFrame: 32561


Problem 2: Simple Dataset File Processing (3 Points)

Write a python script that reads the sample google play store apps data file into a data frame. Notice that this file has the column headers. 

Print the data frame using the print function.

Notice that the second column holds the category of the app. An app category can be Business, Education, etc. 
Find the total number of apps for each category. This can be done by extracting the 'Category' column and counting the number of occurrences of each category using the value_counts() function.
Which category has the highest number of apps? Write code that prints that category, and the number of apps associated with it. Look into these functions: max(), idxmax().

Write a piece of code that finds for each non numeric column, the distinct values that appear in this column, and the frequency of occurrence of each value. For example, for the category column, the code should print all the categories listed, and the number of times each category appears in this column.

In [2]:
df = pd.read_csv('sample_googleplaystore.csv', header=0)
print(df)

categoryCount = df['Category'].value_counts()
print('\n', categoryCount)

print(f'\nHighest: {categoryCount.idxmax()} category has {max(categoryCount)} apps.')

for col in ['App', 'Category', 'Type', 'Content Rating', 'Genres']:
    print('\n', df[col].value_counts())


                                                   App        Category  \
0       Photo Editor & Candy Camera & Grid & ScrapBook  ART_AND_DESIGN   
1                                  Coloring book moana  ART_AND_DESIGN   
2    U Launcher Lite – FREE Live Cool Themes, Hide ...  ART_AND_DESIGN   
3                                Sketch - Draw & Paint  ART_AND_DESIGN   
4                Pixel Draw - Number Art Coloring Book  ART_AND_DESIGN   
..                                                 ...             ...   
234      TurboScan: scan documents and receipts in PDF        BUSINESS   
235                     Tiny Scanner Pro: PDF Doc Scan        BUSINESS   
236                                                Box        BUSINESS   
237                                           Zenefits        BUSINESS   
238                                         Google Ads        BUSINESS   

     Rating  Reviews                Size     Installs  Type   Price  \
0       4.1      159                 19M

Problem 3:  Data Slicing and Computing Stats (8 Points)

Write a script that reads a US Census file into a data frame and
1. Finds the min, max, avg ages.
2. Finds the number of persons with Master's making more than 50K vs. the number of persons with Master's making less than 50K.
3. Finds the percentage of Master's making more than 50K vs. the percentage of Master's making less than 50K.
4. Finds the percentage of people making more than 50K vs those making less than 50K for each education level.

Repeat the same activity above (only number 4) for each type of occupation, that's column 1.

In [63]:
df = pd.read_csv('adult.data', header=None)

# Task-1
print(f'Minimum Age: {df[0].min()}')
print(f'Maximum Age: {df[0].max()}')
print(f'Average Age: {df[0].mean()}')


# Task-2
# Filter the DataFrame for persons with Master's degree
masters_above_50k = df[(df[3] == " Masters") & (df[14] == " >50K")].shape[0]
masters_below_50k = df[(df[3] == " Masters") & (df[14] == " <=50K")].shape[0]
print(f"\nNumber of People with Master's making >50k: {masters_above_50k}")
print(f"Number of People with Master's making <=50k: {masters_below_50k}")


# Task-3
masters_count = df[df[3] == " Masters"].shape[0]
percentage_masters_above_50k = (masters_above_50k / masters_count) * 100
percentage_masters_below_50k = (masters_below_50k / masters_count) * 100
print(f"\nPercentage of People with Master's making >50k: {percentage_masters_above_50k:.2f}%")
print(f"Percentage of People with Master's making <=50k: {percentage_masters_below_50k:.2f}%\n\n")


# Task-4
education_levels = df[3].unique()
for edu_level in education_levels:
    above_50k = df[(df[3] == edu_level) & (df[14] == " >50K")].shape[0]
    below_50k = df[(df[3] == edu_level) & (df[14] == " <=50K")].shape[0]
    
    total = above_50k + below_50k
    percentage_above_50k = (above_50k / total) * 100 if total != 0 else 0
    percentage_below_50k = (below_50k / total) * 100 if total != 0 else 0
    
    print(f"{edu_level}:")
    print(f"  - Percentage making more than 50K: {percentage_above_50k:.2f}%")
    print(f"  - Percentage making less than or equal to 50K: {percentage_below_50k:.2f}%")
    print()


# Task-5
occupations = df[1].unique()
for occupation in occupations:
    above_50k = df[(df[1] == occupation) & (df[14] == " >50K")].shape[0]
    below_50k = df[(df[1] == occupation) & (df[14] == " <=50K")].shape[0]
    
    total = above_50k + below_50k
    percentage_above_50k = (above_50k / total) * 100 if total != 0 else 0
    percentage_below_50k = (below_50k / total) * 100 if total != 0 else 0
    
    print(f"{occupation}:")
    print(f"  - Percentage making more than 50K: {percentage_above_50k:.2f}%")
    print(f"  - Percentage making less than or equal to 50K: {percentage_below_50k:.2f}%")
    print()

Minimum Age: 17
Maximum Age: 90
Average Age: 38.58164675532078

Number of People with Master's making >50k: 959
Number of People with Master's making <=50k: 764

Percentage of People with Master's making >50k: 55.66%
Percentage of People with Master's making <=50k: 44.34%


 Bachelors:
  - Percentage making more than 50K: 41.48%
  - Percentage making less than or equal to 50K: 58.52%

 HS-grad:
  - Percentage making more than 50K: 15.95%
  - Percentage making less than or equal to 50K: 84.05%

 11th:
  - Percentage making more than 50K: 5.11%
  - Percentage making less than or equal to 50K: 94.89%

 Masters:
  - Percentage making more than 50K: 55.66%
  - Percentage making less than or equal to 50K: 44.34%

 9th:
  - Percentage making more than 50K: 5.25%
  - Percentage making less than or equal to 50K: 94.75%

 Some-college:
  - Percentage making more than 50K: 19.02%
  - Percentage making less than or equal to 50K: 80.98%

 Assoc-acdm:
  - Percentage making more than 50K: 24.84%
  - 