# DATA SCIENCE FOUNDATIONS II

## Census Variables

You have decided to volunteer for your local community by offering to clean their recently collected census data. The description of this dataset is as follows:

| column         | description                                                                                                           |
| -------------- | --------------------------------------------------------------------------------------------------------------------- |
| first_name     | The respondent's first name.                                                                                          |
| last_name      | The respondent's last name.                                                                                           |
| birth_year     | The respondent's year of birth.                                                                                       |
| voted          | If the respondent participated in the current voting cycle.                                                           |
| num_children   | The number of children the respondent has.                                                                            |
| income_year    | The average yearly income the respondent earns.                                                                       |
| higher_tax     | The respondent's answer to the question: Rate your agreement with the statement: the wealthy should pay higher taxes. |
| marital_status | The respondent's current marital status.                                                                              |

### Assessing Variable Types

1. The census dataframe is composed of simulated census data to represent demographics of a small community in the U.S. Call the .head() method on the census dataframe and print the output to view the first five rows.

In [5]:
import pandas as pd

# Read in the census dataframe
census = pd.read_csv("../data/census_data.csv", index_col=0)

2. Review the dataframe description and values returned by .head() to assess the variable types of each of the variables. This is an important step to understand what preprocessing will be necessary to work with the data.


In [6]:
print(census.head())

  first_name  last_name birth_year  voted  num_children  income_year  \
0     Denise      Ratke       2005  False             0     92129.41   
1       Hali  Cummerata       1987  False             0     75649.17   
2    Salomon        Orn       1992   True             2    166313.45   
3     Sarina   Schiller       1965  False             2     71704.81   
4       Gust  Abernathy       1945  False             2    143316.08   

       higher_tax marital_status  
0        disagree         single  
1         neutral       divorced  
2           agree         single  
3  strongly agree        married  
4           agree        married  


3. Compare the values returned from the .head() method with the data types of each variable by calling .dtypes on the census dataframe and print the result.

In [7]:
print(census.dtypes)

first_name         object
last_name          object
birth_year         object
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object


### Inspecting Datatypes 

4. The manager of the census would like to know the average birth year of the respondents. We were able to see from .dtypes that birth_year has been assigned the str datatype whereas it should be expressed in int.
    Print the unique values of the variable using the .unique() method.

In [8]:
print(census["birth_year"].unique())

['2005' '1987' '1992' '1965' '1945' '1951' '1963' '1949' '1950' '1971'
 '2007' '1944' '1995' '1973' '1946' '1954' '1994' '1989' '1947' '1993'
 '1976' '1984' 'missing' '1966' '1941' '2000' '1953' '1956' '1960' '2001'
 '1980' '1955' '1985' '1996' '1968' '1979' '2006' '1962' '1981' '1959'
 '1977' '1978' '1983' '1957' '1961' '1982' '2002' '1998' '1999' '1952'
 '1940' '1986' '1958']


### Altering Data

5. There appears to be a missing value in the birth_year column. With some research you find that the respondent’s birth year is 1967.
    Use the .replace() method to replace the missing value with 1967, so that the data type can be changed to int. Then recheck the values in birth_year by calling the .unique() method and printing the results.


In [9]:
census["birth_year"] = census["birth_year"].replace(["missing"], 1967)
print(census["birth_year"].unique())

['2005' '1987' '1992' '1965' '1945' '1951' '1963' '1949' '1950' '1971'
 '2007' '1944' '1995' '1973' '1946' '1954' '1994' '1989' '1947' '1993'
 '1976' '1984' 1967 '1966' '1941' '2000' '1953' '1956' '1960' '2001'
 '1980' '1955' '1985' '1996' '1968' '1979' '2006' '1962' '1981' '1959'
 '1977' '1978' '1983' '1957' '1961' '1982' '2002' '1998' '1999' '1952'
 '1940' '1986' '1958']


6.  Now that we have adjusted the values in the `birth_year` variable, change the datatype from `str` to `int` and print the datatypes of the census dataframe with `.dtypes`.

In [10]:
# use the astype method to switch data types to int
census["birth_year"] = census["birth_year"].astype(int)

# print the data types in the census datafram
print(census.dtypes)

first_name         object
last_name          object
birth_year          int32
voted                bool
num_children        int64
income_year       float64
higher_tax         object
marital_status     object
dtype: object


7.  Having assigned `birth_year` to the appropriate data type, print the average birth year of the respondents to the census using the pandas `.mean()` method.

In [11]:
print(census["birth_year"].mean())

1973.4


8.  Your manager would like to set an order to the higher_tax variable so that: strongly disagree < disagree < neutral < agree < strongly agree.
    Convert the higher_tax variable to the category data type with the appropriate order, then print the new order using the .unique() method.


In [12]:
# Converting the higher_tax column to categorical data
census["higher_tax"] = pd.Categorical(
    census["higher_tax"],
    ["strongly disagree", "disagree", "neutral", "agree", "strongly agree"],
    ordered=True,
)

# print out unique values in the higher_tax column
print(census["higher_tax"].unique())

['disagree', 'neutral', 'agree', 'strongly agree', 'strongly disagree']
Categories (5, object): ['strongly disagree' < 'disagree' < 'neutral' < 'agree' < 'strongly agree']


9.  Your manager would also like to know the median sentiment of the respondents on the issue of higher taxes for the wealthy. Label encode the higher_tax variable and print the median using the pandas .median() method.

In [13]:
# Use cat.codes to label encode the higher_tax variable
census["higher_tax"] = census["higher_tax"].cat.codes

# print out the median of the higher_tax variable
print(census["higher_tax"].median())

2.0



10. Your manager is interested in using machine learning models on the census data in the future. To help, let’s One-Hot Encode marital_status to create binary variables of each category. Use the pandas get_dummies() method to One-Hot Encode the marital_status variable.
    Print the first five rows of the new dataframe with the .head() method. Note that you’ll have to scroll to the right or expand the web-browser to see the dummy variables.

In [15]:
# Use get_dummies to OHE the marital_status
census = pd.get_dummies(census, columns=['marital_status'])
 
# print out the first 5 rows in the census dataframe
print(census.head())

print(census)


  first_name  last_name  birth_year  voted  num_children  income_year  \
0     Denise      Ratke        2005  False             0     92129.41   
1       Hali  Cummerata        1987  False             0     75649.17   
2    Salomon        Orn        1992   True             2    166313.45   
3     Sarina   Schiller        1965  False             2     71704.81   
4       Gust  Abernathy        1945  False             2    143316.08   

   higher_tax  marital_status_divorced  marital_status_married  \
0           1                    False                   False   
1           2                     True                   False   
2           3                    False                   False   
3           4                    False                    True   
4           3                    False                    True   

   marital_status_single  marital_status_widowed  
0                   True                   False  
1                  False                   False  
2          


11. There are additional operations you can perform on the data, such as:
    
    Create a new variable called `marital_codes` by Label Encoding the `marital_status` variable. \
    This could help the Census team use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their marital status.
    
    Create a new variable called `age_group`, which groups respondents based on their birth year. \
    The groups should be in five-year increments, e.g., 25-30, 31-35, etc. \
    Then label encode the `age_group` variable to assist the Census team in the event they would like to use machine learning to predict if a respondent thinks the wealthy should pay higher taxes based on their age group.


### Martital Codes

In [None]:
marital_codes

### Higher Taxes

In [None]:
age_group = 