## Preparing Dataset

In [1]:
import pandas as pd

data = {
    "region": [
        "East South Central", "Pacific", "Mountain", "West South Central", "Pacific", "Mountain", "New England",
        "South Atlantic", "South Atlantic", "South Atlantic", "South Atlantic", "Pacific", "Mountain",
        "East North Central", "East North Central", "West North Central", "West North Central", "East South Central",
        "West South Central", "New England", "South Atlantic", "New England", "East North Central",
        "West North Central", "East South Central", "West North Central", "Mountain", "West North Central",
        "Mountain", "New England", "Mid-Atlantic", "Mountain", "Mid-Atlantic", "South Atlantic",
        "West North Central", "East North Central", "West South Central", "Pacific", "Mid-Atlantic",
        "New England", "South Atlantic", "West North Central", "East South Central", "West South Central",
        "Mountain", "New England", "South Atlantic", "Pacific", "South Atlantic", "East North Central", "Mountain"
    ],
    "state": [
        "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
        "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho",
        "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky",
        "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan",
        "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska",
        "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York",
        "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
        "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee",
        "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"
    ],
    "individuals": [
        2570.0, 1434.0, 7259.0, 2280.0, 109008.0, 7607.0, 2280.0,
        708.0, 3770.0, 21443.0, 6943.0, 4131.0, 1297.0,
        6752.0, 3776.0, 1711.0, 1443.0, 2735.0,
        2540.0, 1450.0, 4914.0, 6811.0, 5209.0,
        3993.0, 1024.0, 3776.0, 983.0, 1745.0,
        7058.0, 835.0, 6048.0, 1949.0, 39827.0,
        6451.0, 467.0, 6929.0, 2823.0, 11139.0,
        8163.0, 747.0, 3082.0, 836.0, 6139.0,
        19199.0, 1904.0, 780.0, 3928.0, 16424.0, 1021.0, 2740.0, 434.0
    ],
    "family_members": [
        864.0, 582.0, 2606.0, 432.0, 20964.0, 3250.0, 1696.0,
        374.0, 3134.0, 9587.0, 2556.0, 2399.0, 715.0,
        3891.0, 1482.0, 1038.0, 773.0, 953.0,
        519.0, 1066.0, 2230.0, 13257.0, 3142.0,
        3250.0, 328.0, 2107.0, 422.0, 676.0,
        486.0, 615.0, 3350.0, 602.0, 52070.0,
        2817.0, 75.0, 3320.0, 1048.0, 3337.0,
        5349.0, 354.0, 851.0, 323.0, 1744.0,
        6111.0, 972.0, 511.0, 2047.0, 5880.0, 222.0, 2167.0, 205.0
    ],
    "state_pop": [
        4887681, 735139, 7158024, 3009733, 39461588, 5691287, 3571520,
        965479, 701547, 21244317, 10511131, 1420593, 1750536,
        12723071, 6695497, 3148618, 2911359, 4461153,
        4659690, 1339057, 6035802, 6882635, 9984072,
        5606249, 2981020, 6121623, 1060665, 1925614,
        3027341, 1353465, 8886025, 2092741, 19530351,
        10381615, 758080, 11676341, 3940235, 4181886,
        12800922, 1058287, 5084156, 878698, 6771631,
        28628666, 3153550, 624358, 8501286, 7523869, 1804291, 5807406, 577601
    ]
}

homelessness = pd.DataFrame(data)

## Exercise: Sorting rows

When working with a DataFrame, reorganizing the rows can help you discover patterns or highlight specific data points. The `.sort_values()` method lets you sort the DataFrame based on one or more columns.

### Syntax Overview
- **Sort by a single column**:
  ```python
  df.sort_values("column_name")
   ````

- **Sort by multiple columns** (helps break ties):

  ```python
  df.sort_values(["col1", "col2"])
  ```

By combining `.sort_values()` with `.head()`, you can answer questions like “What are the top entries where...?”

### Instructions:

1. **Sort by Individual Homeless Counts (Ascending)**
   Arrange the `homelessness` DataFrame based on the `individuals` column from lowest to highest.
   Store the result in a new variable called `homelessness_ind` and display the first few rows.

2. **Sort by Family Member Counts (Descending)**
   Reorder the DataFrame by the `family_members` column in descending order.
   Save the result as `homelessness_fam`.

3. **Sort by Region and Family Members**
   First sort by `region` (A–Z), and then within each region, sort by `family_members` in descending order.
   Store the final DataFrame in `homelessness_reg_fam`.

In [37]:
# 1. Sort by 'individuals' in ascending order
homelessness_ind = homelessness.sort_values('individuals')

# Print the top few rows
print(homelessness_ind.head())

                region         state  individuals  family_members  state_pop
50            Mountain       Wyoming        434.0           205.0     577601
34  West North Central  North Dakota        467.0            75.0     758080
7       South Atlantic      Delaware        708.0           374.0     965479
39         New England  Rhode Island        747.0           354.0    1058287
45         New England       Vermont        780.0           511.0     624358


In [36]:
# 2. Sort by 'family_members' in descending order
homelessness_fam = homelessness.sort_values('family_members', ascending=False)

print(homelessness_fam.head())

                region          state  individuals  family_members  state_pop
32        Mid-Atlantic       New York      39827.0         52070.0   19530351
4              Pacific     California     109008.0         20964.0   39461588
21         New England  Massachusetts       6811.0         13257.0    6882635
9       South Atlantic        Florida      21443.0          9587.0   21244317
43  West South Central          Texas      19199.0          6111.0   28628666


In [35]:
# 3. Sort by 'region' (asc), then by 'family_members' (desc)
homelessness_reg_fam = homelessness.sort_values(['region', 'family_members'], ascending=[True, False])

# Print the top few rows
print(homelessness_reg_fam.head())

                region      state  individuals  family_members  state_pop
13  East North Central   Illinois       6752.0          3891.0   12723071
35  East North Central       Ohio       6929.0          3320.0   11676341
22  East North Central   Michigan       5209.0          3142.0    9984072
49  East North Central  Wisconsin       2740.0          2167.0    5807406
14  East North Central    Indiana       3776.0          1482.0    6695497


## Exercise: Subsetting columns

Often, you won’t need all the columns from a dataset. Instead, you can focus on just the ones that are relevant to your analysis. In pandas, square brackets `[]` let you extract specific columns from a DataFrame.

### Column Selection Syntax
- **Select a single column** (returns a Series):
   ```python
  df["col_a"]
   ````

- **Select multiple columns** (returns a DataFrame):

  ```python
  df[["col_a", "col_b"]]
  ```

### Instructions:

1. **Select One Column as a Series**
   Create a variable called `individuals` that stores just the `individuals` column from `homelessness`.

2. **Select Two Columns as a DataFrame**
   Extract the `state` and `family_members` columns (in that order), and save the result in a DataFrame named `state_fam`.

3. **Reorder and Select Columns**
   Create a new DataFrame called `ind_state` that includes the `individuals` and `state` columns from `homelessness`, in that order.

In [34]:
# 1. Select the 'individuals' column
individuals = homelessness['individuals']

print(individuals.head())

0      2570.0
1      1434.0
2      7259.0
3      2280.0
4    109008.0
Name: individuals, dtype: float64


In [33]:
# 2. Select 'state' and 'family_members' columns
state_fam = homelessness[['state', 'family_members']]

print(state_fam.head())

        state  family_members
0     Alabama           864.0
1      Alaska           582.0
2     Arizona          2606.0
3    Arkansas           432.0
4  California         20964.0


In [32]:
# 3. Select 'individuals' and 'state' columns in custom order
ind_state = homelessness[['individuals', 'state']]

print(ind_state.head())

   individuals       state
0       2570.0     Alabama
1       1434.0      Alaska
2       7259.0     Arizona
3       2280.0    Arkansas
4     109008.0  California


## Exercise: Subsetting rows

One of the most powerful tools in data analysis is identifying specific rows that match certain criteria. This process is often called *filtering* or *subsetting* rows.

### Basic Row Filtering

You can use **relational operators** (like `>`, `<`, `==`) to filter rows. These expressions return a Series of `True` or `False`, which you can then use to extract just the rows you care about.

Examples:
```python
# Dogs taller than 60 cm
dogs[dogs["height_cm"] > 60]

# Dogs that are tan in color
dogs[dogs["color"] == "tan"]
```

### Combining Conditions

To filter using more than one condition, use the **bitwise AND** operator `&` inside parentheses:

```python
# Tall, tan dogs
dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]
```

### Instructions:

1. **Filter by Number of Individuals**
   Select rows where the `individuals` column is greater than 10,000.
   Save this subset as `ind_gt_10k`.

2. **Filter by Region: Mountain**
   Select rows where the `region` column equals `"Mountain"`.
   Save this as `mountain_reg`.

3. **Filter by Multiple Conditions**
   Select rows where `family_members` is less than 1,000 **and** the `region` is `"Pacific"`.
   Store the result in `fam_lt_1k_pac`.

In [31]:
# 1. Individuals greater than 10,000
ind_gt_10k = homelessness[homelessness['individuals'] > 10000]

# View the result
print(ind_gt_10k)

                region       state  individuals  family_members  state_pop
4              Pacific  California     109008.0         20964.0   39461588
9       South Atlantic     Florida      21443.0          9587.0   21244317
32        Mid-Atlantic    New York      39827.0         52070.0   19530351
37             Pacific      Oregon      11139.0          3337.0    4181886
43  West South Central       Texas      19199.0          6111.0   28628666
47             Pacific  Washington      16424.0          5880.0    7523869


In [30]:
# 2. Region is "Mountain"
mountain_reg = homelessness[homelessness['region'] == 'Mountain']

# View the result
print(mountain_reg)

      region       state  individuals  family_members  state_pop
2   Mountain     Arizona       7259.0          2606.0    7158024
5   Mountain    Colorado       7607.0          3250.0    5691287
12  Mountain       Idaho       1297.0           715.0    1750536
26  Mountain     Montana        983.0           422.0    1060665
28  Mountain      Nevada       7058.0           486.0    3027341
31  Mountain  New Mexico       1949.0           602.0    2092741
44  Mountain        Utah       1904.0           972.0    3153550
50  Mountain     Wyoming        434.0           205.0     577601


In [28]:
# 3. Family members < 1,000 AND region is "Pacific"
fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1000) & (homelessness["region"] == "Pacific")]

# View the result
print(fam_lt_1k_pac)

    region   state  individuals  family_members  state_pop
1  Pacific  Alaska       1434.0           582.0     735139


## 📘 Exercise: Filtering Rows Using Categorical Values

When working with categorical data, you may want to filter your DataFrame for **multiple category values**. While you could use the `or` (`|`) operator multiple times, it can quickly become messy and repetitive.

### A Better Approach: `.isin()`

The `.isin()` method provides a cleaner, more efficient way to check if a value belongs to a list of specified categories.

Example:
```python
# Define a list of target colors
colors = ["brown", "black", "tan"]

# Create a condition that checks if each row's color is in the list
condition = dogs["color"].isin(colors)

# Subset the DataFrame based on the condition
filtered_dogs = dogs[condition]
```

This is especially useful when subsetting for multiple values of a categorical column like `region`, `state`, or `breed`.

### Instructions:

You are provided with a DataFrame called `homelessness` and `pandas` is already imported as `pd`.

1. Filter `homelessness` to include only rows where the `state` is in a given list of Mojave-region states:

   ```python
   canu = ["California", "Arizona", "Nevada", "Utah"]
   ```
2. Use `.isin()` to perform the filter in a clean and readable way.
3. Assign the filtered result to `mojave_homelessness` and print it.

In [26]:
# Define the list of Mojave-region states
canu = ["California", "Arizona", "Nevada", "Utah"]

# Filter using .isin()
mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# View the result
print(mojave_homelessness)

      region       state  individuals  family_members  state_pop
2   Mountain     Arizona       7259.0          2606.0    7158024
4    Pacific  California     109008.0         20964.0   39461588
28  Mountain      Nevada       7058.0           486.0    3027341
44  Mountain        Utah       1904.0           972.0    3153550
