# Exploring a target variable

You have been asked to build a machine learning model to predict whether or not a person makes over $50,000 in a year. To understand the target variable, Above/Below 50k, you decide to explore the variable in more detail.


* Explore the Above/Below 50k variable by printing out a description of the variable's contents.
* Explore the Above/Below 50k variable by printing out a frequency table of the values found in this column.
* Rerun .value_counts(), but this time print out the relative frequency values instead of the counts.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

adult = pd.read_csv("/kaggle/input/adult-census-income/adult.csv")
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [2]:
# Explore the Above/Below 50k variable
print(adult["income"].describe())

# Print a frequency table of "Above/Below 50k"
print(adult["income"].value_counts())



count     32561
unique        2
top       <=50K
freq      24720
Name: income, dtype: object
income
<=50K    24720
>50K      7841
Name: count, dtype: int64


In [3]:
# Print relative frequency values
print(adult["income"].value_counts(normalize = True))

income
<=50K    0.75919
>50K     0.24081
Name: proportion, dtype: float64


Well done! Above/Below 50k is a categorical variable with only two categories. Using both the .describe() and .value_counts() methods you can see that the dataset is a little imbalanced towards people making less than $50,000.

# Setting dtypes and saving memory

A colleague of yours is exploring a list of occupations and how they relate to salary. She has given you a list of these occupations, list_of_occupations, and has a few simple questions such as "How many different titles are there?" and "Which position is the most common?

*  XP
Create a pandas Series, series1, using the list_of_occupations (do not set the dt
* Print both the data type and number of bytes used of this new Series.
* Create a second pandas Series, series2, using the list_of_occupations and set the dtype to "category".
ype).

In [4]:
list_of_occupations = pd.read_csv("/kaggle/input/occupations-in-restaurants-of-us/Occupations by Share.csv")
list_of_occupations.head()

Unnamed: 0,ID Major Occupation Group,Major Occupation Group,ID Minor Occupation Group,Minor Occupation Group,ID Broad Occupation,Broad Occupation,ID Detailed Occupation,Detailed Occupation,ID Year,Year,...,Workforce Status,Total Population,Total Population MOE Appx,Average Wage,Average Wage Appx MOE,Record Count,Slug Detailed Occupation,PUMS Industry,ID PUMS Industry,Slug PUMS Industry
0,110000-290000,"Management, business, science, & arts occupations",110000-130000,"Management, business, & financial occupations",110000,Management occupations,111021,General & operations managers,2018,2018,...,True,131544,8900.597026,56918.656655,2782.391462,1144,general-operations-managers,Restaurants & Food Services,722Z,restaurants-food-services
1,110000-290000,"Management, business, science, & arts occupations",110000-130000,"Management, business, & financial occupations",110000,Management occupations,1110XX,Chief executives & legislators,2018,2018,...,True,16502,3153.69058,149314.985456,27376.078514,159,chief-executives-legislators,Restaurants & Food Services,722Z,restaurants-food-services
2,110000-290000,"Management, business, science, & arts occupations",110000-130000,"Management, business, & financial occupations",110000,Management occupations,112021,Marketing managers,2018,2018,...,True,8721,2292.689463,72000.831097,26238.474059,91,marketing-managers,Restaurants & Food Services,722Z,restaurants-food-services
3,110000-290000,"Management, business, science, & arts occupations",110000-130000,"Management, business, & financial occupations",110000,Management occupations,112022,Sales managers,2018,2018,...,True,16783,3180.425146,76511.654293,5423.302915,135,sales-managers,Restaurants & Food Services,722Z,restaurants-food-services
4,110000-290000,"Management, business, science, & arts occupations",110000-130000,"Management, business, & financial occupations",110000,Management occupations,113013,Facilities managers,2018,2018,...,True,1882,1065.079032,74134.693411,17802.213705,17,facilities-managers,Restaurants & Food Services,722Z,restaurants-food-services


In [5]:
import pandas as pd
import itertools

# Example list of lists (replace with your actual data)
list_of_occupations = [['Engineer', 'Doctor'], ['Artist', 'Teacher']]

# Flatten the list
flattened_list = list(itertools.chain.from_iterable(list_of_occupations))

# Now create a Pandas series
series1 = pd.Series(flattened_list)

# Print the data type and number of bytes for series1
print("series1 data type:", series1.dtype)
print("series1 number of bytes:", series1.nbytes)



series1 data type: object
series1 number of bytes: 32


In [6]:
import pandas as pd
import itertools

# Example list of lists (replace with your actual data)
list_of_occupations = [['Engineer', 'Doctor'], ['Artist', 'Teacher']]

# Flatten the list
flattened_list = list(itertools.chain.from_iterable(list_of_occupations))

# Create a Pandas series with 'category' dtype
series2 = pd.Series(flattened_list, dtype="category")

# Print out the data type and number of bytes for series2
print("series2 data type:", series2.dtype)
print("series2 number of bytes:", series2.nbytes)


series2 data type: category
series2 number of bytes: 36


**Creating a categorical pandas Series******

Another colleague at work has collected information on the number of "Gold", "Silver", and "Bronze" medals won by the USA at the Summer & Winter Olympics since 1896. She has provided this as a list, medals_won. Before taking a look at the total number of each medal won, you want to create a categorical pandas Series. However, you know that these medals have a specific order to them and that Gold is better than Silver, but Silver is better than Bronze. Use the object, medals_won, to help.0
* Create a categorical pandas Series without using pd.Series().
* Specify the three known medal categories such that "Bronze" < "Silver" < "Gold".
* Specify that the order of the categories is important when creating this Series.
Series.

In [7]:


# Try reading the file with a different encoding
medals_won = pd.read_csv("/kaggle/input/summer-olympics-medals/Summer-Olympic-medals-1976-to-2008.csv", encoding='latin1')

# Display the first few rows
medals_won.head()


Unnamed: 0,City,Year,Sport,Discipline,Event,Athlete,Gender,Country_Code,Country,Event_gender,Medal
0,Montreal,1976.0,Aquatics,Diving,3m springboard,"KÖHLER, Christa",Women,GDR,East Germany,W,Silver
1,Montreal,1976.0,Aquatics,Diving,3m springboard,"KOSENKOV, Aleksandr",Men,URS,Soviet Union,M,Bronze
2,Montreal,1976.0,Aquatics,Diving,3m springboard,"BOGGS, Philip George",Men,USA,United States,M,Gold
3,Montreal,1976.0,Aquatics,Diving,3m springboard,"CAGNOTTO, Giorgio Franco",Men,ITA,Italy,M,Silver
4,Montreal,1976.0,Aquatics,Diving,10m platform,"WILSON, Deborah Keplar",Women,USA,United States,W,Bronze


In [8]:
# Create a categorical Series and specify the categories (let pandas know the order matters!)
categories = ["Bronze", "Silver", "Gold"]
medals = pd.Categorical(medals_won["Medal"], categories=categories, ordered=True)
print(medals)

['Silver', 'Bronze', 'Gold', 'Silver', 'Bronze', ..., 'Bronze', 'Gold', 'Silver', 'Gold', 'Gold']
Length: 15433
Categories (3, object): ['Bronze' < 'Silver' < 'Gold']


**Great work. pd.Categorical() is a great way to create a Series and specify both the categories and whether or not the order of these categories is important.**

**Setting dtype when reading data**

You are preparing to create a machine learning model to predict a person's income category using the adult census income dataset. You don't have access to any cloud resources and you want to make sure that your laptop will be able to load the full dataset and process its contents. You have read in the first five rows of the dataset adult to help you understand what kind of columns are available.5*  XP
Call the correct attribute on the adult DataFrame to review the data t
* Create a dictionary with keys: "Workclass", "Education", "Relationship", and "Above/Below 50k".* 
Set the value for each key to be "category"
* Use the newly created dictionary, adult_dtypes, when reading in adult.csv.


ypes.

In [9]:

# Check the data types of the DataFrame
print(adult.dtypes)

age                int64
workclass         object
fnlwgt             int64
education         object
education.num      int64
marital.status    object
occupation        object
relationship      object
race              object
sex               object
capital.gain       int64
capital.loss       int64
hours.per.week     int64
native.country    object
income            object
dtype: object


In [10]:
# Check the dtypes
print(adult.dtypes)

# Create a dictionary with column names as keys and "category" as values
adult_dtypes = {
  "Workclass": "category",
  "Education" : "category",
  "Relationship":"category",
  "Above/Below 50k":"category"
}

age                int64
workclass         object
fnlwgt             int64
education         object
education.num      int64
marital.status    object
occupation        object
relationship      object
race              object
sex               object
capital.gain       int64
capital.loss       int64
hours.per.week     int64
native.country    object
income            object
dtype: object


In [11]:

# Read in the CSV using the dtypes parameter
adult2 = pd.read_csv(
  "/kaggle/input/adult-census-income/adult.csv",
  dtype = adult_dtypes
)
print(adult2.dtypes)

age                int64
workclass         object
fnlwgt             int64
education         object
education.num      int64
marital.status    object
occupation        object
relationship      object
race              object
sex               object
capital.gain       int64
capital.loss       int64
hours.per.week     int64
native.country    object
income            object
dtype: object


# Setting Setting up a .groupby() statement

The gender wage gap is a hot-topic item in the United States and across the world. Using the adult census income dataset, loaded as adult, you want to check if some of the recently published data lines up with this income survey.0
* Split the adult dataset across the "Sex" and "Above/Below 50k" columns, saving this object as gb.
* Print out the number of observations found in each group.
* Using gb, find the average of each numerical column.up a .groupby() statementThe gender wage gap is a hot-topic item in the United States and across the world.
*  Using the adult census income dataset, loaded as adult, you want to check if some of the recently published data lines up with this income survey.
 soerical column.

In [13]:
adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [22]:
adult1 = adult.select_dtypes(include=['float', 'int']).copy()
adult1['income'] = adult['income'] 
adult1["sex"] = adult["sex"]
adult1.head()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week,income,sex
0,90,77053,9,0,4356,40,<=50K,Female
1,82,132870,9,0,4356,18,<=50K,Female
2,66,186061,10,0,4356,40,<=50K,Female
3,54,140359,4,0,3900,40,<=50K,Female
4,41,264663,10,0,3900,40,<=50K,Female


In [23]:
# Group the adult dataset by "Sex" and "Above/Below 50k"
gb = adult1.groupby(by=["sex", "income"])

# Print out how many rows are in each created group
print(gb.size())

# Print out the mean of each group for all columns
print(gb.mean())

sex     income
Female  <=50K      9592
        >50K       1179
Male    <=50K     15128
        >50K       6662
dtype: int64
                     age         fnlwgt  education.num  capital.gain  \
sex    income                                                          
Female <=50K   36.210801  185999.381359       9.820475    121.986134   
       >50K    42.125530  183687.406277      11.787108   4200.389313   
Male   <=50K   37.147012  193093.609268       9.452142    165.723823   
       >50K    44.625788  188769.101321      11.580606   3971.765836   

               capital.loss  hours.per.week  
sex    income                                
Female <=50K      47.364470       35.916701  
       >50K      173.648855       40.426633  
Male   <=50K      56.806782       40.693879  
       >50K      198.780396       46.366106  


Excellent! It does look like the proportion of women making more than 50k is a lot lower than men. However, women making more than 50k are on average younger than their male counterparts.

# Using pandas functions effectively

You are creating a Python application that will calculate summary statistics based on user-selected variables. The complete dataset is quite large. For now, you are setting up your code using part of the dataset, preloaded as adult. As you create a reusable process, make sure you are thinking through the most efficient way to setup the GroupBy object.0
* Create a list of the names for two user-selected variables: "Education" and "Above/Below 50k".
* Create a GroupBy object, gb, using the user_list as the grouping variables.
* Calculate the mean of "Hours/Week" across each group using the most efficient approach covered in the video.
 video.

In [24]:
# Create a list of user-selected variables
user_list = ["education", "income"]
# Create a GroupBy object using this list
gb = adult.groupby(by = user_list)

# Find the mean for the variable "Hours/Week" for each group - Be efficient!
print(gb["hours.per.week"].mean())

education     income
10th          <=50K     36.574053
              >50K      43.774194
11th          <=50K     33.322870
              >50K      45.133333
12th          <=50K     35.035000
              >50K      44.818182
1st-4th       <=50K     37.864198
              >50K      48.833333
5th-6th       <=50K     38.539432
              >50K      46.000000
7th-8th       <=50K     38.830033
              >50K      47.500000
9th           <=50K     37.667351
              >50K      44.851852
Assoc-acdm    <=50K     39.264339
              >50K      44.256604
Assoc-voc     <=50K     40.817826
              >50K      43.853186
Bachelors     <=50K     40.586152
              >50K      45.475462
Doctorate     <=50K     45.429907
              >50K      47.513072
HS-grad       <=50K     39.727510
              >50K      45.042985
Masters       <=50K     41.223822
              >50K      45.917623
Preschool     <=50K     36.647059
Prof-school   <=50K     42.816993
              >50K      49.

** People earning more than $50,000 tend to work a lot more hours, regardless of their education, than people earning less than $50,000. Remember, it's important to select your variables before calling a function. Large datatsets might have problems calculating the mean of every numerical column.**

