## Preparing Dataset

In [1]:
import pandas as pd

data = {
    "region": [
        "East South Central", "Pacific", "Mountain", "West South Central", "Pacific", "Mountain", "New England",
        "South Atlantic", "South Atlantic", "South Atlantic", "South Atlantic", "Pacific", "Mountain",
        "East North Central", "East North Central", "West North Central", "West North Central", "East South Central",
        "West South Central", "New England", "South Atlantic", "New England", "East North Central",
        "West North Central", "East South Central", "West North Central", "Mountain", "West North Central",
        "Mountain", "New England", "Mid-Atlantic", "Mountain", "Mid-Atlantic", "South Atlantic",
        "West North Central", "East North Central", "West South Central", "Pacific", "Mid-Atlantic",
        "New England", "South Atlantic", "West North Central", "East South Central", "West South Central",
        "Mountain", "New England", "South Atlantic", "Pacific", "South Atlantic", "East North Central", "Mountain"
    ],
    "state": [
        "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
        "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho",
        "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky",
        "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan",
        "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska",
        "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York",
        "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
        "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee",
        "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"
    ],
    "individuals": [
        2570.0, 1434.0, 7259.0, 2280.0, 109008.0, 7607.0, 2280.0,
        708.0, 3770.0, 21443.0, 6943.0, 4131.0, 1297.0,
        6752.0, 3776.0, 1711.0, 1443.0, 2735.0,
        2540.0, 1450.0, 4914.0, 6811.0, 5209.0,
        3993.0, 1024.0, 3776.0, 983.0, 1745.0,
        7058.0, 835.0, 6048.0, 1949.0, 39827.0,
        6451.0, 467.0, 6929.0, 2823.0, 11139.0,
        8163.0, 747.0, 3082.0, 836.0, 6139.0,
        19199.0, 1904.0, 780.0, 3928.0, 16424.0, 1021.0, 2740.0, 434.0
    ],
    "family_members": [
        864.0, 582.0, 2606.0, 432.0, 20964.0, 3250.0, 1696.0,
        374.0, 3134.0, 9587.0, 2556.0, 2399.0, 715.0,
        3891.0, 1482.0, 1038.0, 773.0, 953.0,
        519.0, 1066.0, 2230.0, 13257.0, 3142.0,
        3250.0, 328.0, 2107.0, 422.0, 676.0,
        486.0, 615.0, 3350.0, 602.0, 52070.0,
        2817.0, 75.0, 3320.0, 1048.0, 3337.0,
        5349.0, 354.0, 851.0, 323.0, 1744.0,
        6111.0, 972.0, 511.0, 2047.0, 5880.0, 222.0, 2167.0, 205.0
    ],
    "state_pop": [
        4887681, 735139, 7158024, 3009733, 39461588, 5691287, 3571520,
        965479, 701547, 21244317, 10511131, 1420593, 1750536,
        12723071, 6695497, 3148618, 2911359, 4461153,
        4659690, 1339057, 6035802, 6882635, 9984072,
        5606249, 2981020, 6121623, 1060665, 1925614,
        3027341, 1353465, 8886025, 2092741, 19530351,
        10381615, 758080, 11676341, 3940235, 4181886,
        12800922, 1058287, 5084156, 878698, 6771631,
        28628666, 3153550, 624358, 8501286, 7523869, 1804291, 5807406, 577601
    ]
}

homelessness = pd.DataFrame(data)

## Exercise: Adding new columns

When working with data, you’ll often need to **generate new columns** from existing ones. This process is commonly referred to as **feature engineering** or **data transformation**.

You can:
- Add new columns from scratch.
- Derive columns using arithmetic or logical operations on existing data.


### About the Dataset

The `homelessness` DataFrame contains estimates of homelessness for each U.S. state in 2018:

- `individuals`: Number of homeless individuals *not* in a family with children.
- `family_members`: Number of homeless individuals *in* families with children.
- `state_pop`: Total population of the state.

`pandas` is already imported as `pd`.

### Instructions:

1. Create a new column named **`total_homeless`** that represents the **sum** of `individuals` and `family_members`.
2. Add another column called **`percent_homeless`** that shows the **proportion** of homeless people to the state’s total population.
3. Print the updated DataFrame.

In [2]:
# Compute the total number of homeless individuals
homelessness["total_homeless"] = homelessness["individuals"] + homelessness["family_members"]

# Calculate the proportion of homeless individuals in the total population
homelessness["percent_homeless"] = homelessness["total_homeless"] / homelessness["state_pop"]

# Display the modified DataFrame
print(homelessness.head())

               region       state  individuals  family_members  state_pop  \
0  East South Central     Alabama       2570.0           864.0    4887681   
1             Pacific      Alaska       1434.0           582.0     735139   
2            Mountain     Arizona       7259.0          2606.0    7158024   
3  West South Central    Arkansas       2280.0           432.0    3009733   
4             Pacific  California     109008.0         20964.0   39461588   

   total_homeless  percent_homeless  
0          3434.0          0.000703  
1          2016.0          0.002742  
2          9865.0          0.001378  
3          2712.0          0.000901  
4        129972.0          0.003294  


## Exercise: Combo-attack!

You've now practiced the four most essential data wrangling techniques:
- Sorting rows
- Selecting columns
- Filtering rows
- Creating new columns

Now it's time to combine them all to answer a powerful analytical question:

### Question
**Which U.S. states have more than 20 homeless individuals per 10,000 residents?**

Let’s find out by putting together multiple pandas operations in sequence.

### Instructions:

1. **Create a new column** `indiv_per_10k` to calculate the number of homeless individuals per 10,000 people in each state.
2. **Filter the DataFrame** to include only rows where `indiv_per_10k` is greater than 20.
3. **Sort the filtered DataFrame** in descending order based on `indiv_per_10k`.
4. **Select only** the `state` and `indiv_per_10k` columns for the final output.
5. **Print the result**.

In [3]:
# Step 1: Add column for individuals per 10k people
homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"]

# Step 2: Filter states where the rate exceeds 20
high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

# Step 3: Sort the filtered data in descending order
high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k", ascending=False)

# Step 4: Select only relevant columns
result = high_homelessness_srt[["state", "indiv_per_10k"]]

# Step 5: Display the outcome
print(result)

                   state  indiv_per_10k
8   District of Columbia      53.738381
11                Hawaii      29.079406
4             California      27.623825
37                Oregon      26.636307
28                Nevada      23.314189
47            Washington      21.829195
32              New York      20.392363
