# Inspecting a DataFrame

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains.
- `.head()` returns the first few rows (the “head” of the DataFrame).
- `.info()` shows information on each of the columns, such as the data type and number of missing values.
- `.shape` returns the number of rows and columns of the DataFrame.
- `.describe()` calculates a few summary statistics for each column.

In [1]:
# # Print the head of the homelessness data
# print(homelessness.head())

# # Print information about homelessness
# print(homelessness.info())

# # Print the shape of homelessness
# print(homelessness.shape)

# # Print a description of homelessness
# print(homelessness.describe())

# Parts of a DataFrame

`pandas` three components, stored as attributes:

- `.values`: A two-dimensional NumPy array of values.
- `.columns`: An index of columns: the column names.
- `.index`: An index for the rows: either row numbers or row names.

In [2]:
# # Import pandas using the alias pd
# import pandas as pd

# # Print the values of homelessness
# print(homelessness.values)

# # Print the column index of homelessness
# print(homelessness.columns)

# # Print the row index of homelessness
# print(homelessness.index)

# Sorting rows

Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to `.sort_values()`.

In [3]:
# # Sort homelessness by individual
# homelessness_ind = homelessness.sort_values("individuals")

# # Print the top few rows
# print(homelessness_ind.head())

In [4]:
# # Sort homelessness by descending family members
# homelessness_fam = homelessness.sort_values("family_members", ascending = False)

# # Print the top few rows
# print(homelessness_fam.head())

In [5]:
# # Sort homelessness by region, then descending family members
# homelessness_reg_fam = homelessness.sort_values(["region","family_members"], ascending = [True, False])

# # Print the top few rows
# print(homelessness_reg_fam.head())

# Subsetting columns

When working with data, you may not need all of the variables in your dataset. Square brackets (`[]`) can be used to select only the columns that matter to you in an order that makes sense to you. To select only `"col_a"` of the DataFrame

In [6]:
# # Select the individuals column
# individuals = homelessness["individuals"]

# # Print the head of the result
# print(individuals.head())

In [7]:
# # Select the state and family_members columns
# state_fam = homelessness[["state","family_members"]]

# # Print the head of the result
# print(state_fam.head())

In [8]:
# # Select only the individuals and state columns, in that order
# ind_state = homelessness[["individuals","state"]]

# # Print the head of the result
# print(ind_state.head())

# Subsetting rows

One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

In [9]:
# # Filter for rows where individuals is greater than 10000
# ind_gt_10k = homelessness[homelessness["individuals"] > 10000]

# # See the result
# print(ind_gt_10k)

In [10]:
# # Filter for rows where region is Mountain
# mountain_reg = homelessness[homelessness["region"] == "Mountain"]

# # See the result
# print(mountain_reg)

In [11]:
# # Filter for rows where family_members is less than 1000 
# # and region is Pacific
# fam_lt_1k_pac = homelessness[(homelessness["family_members"] < 1000) & (homelessness["region"] == "Pacific")]

# # See the result
# print(fam_lt_1k_pac)

# Subsetting rows by categorical variables

Subsetting data based on a categorical variable often involves using the "or" operator (|) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the `.isin()` method.

In [12]:
# # Subset for rows in South Atlantic or Mid-Atlantic regions
# south_mid_atlantic = homelessness[homelessness["region"].isin(["South Atlantic","Mid-Atlantic"])]

# # See the result
# print(south_mid_atlantic)

In [13]:
# # The Mojave Desert states
# canu = ["California", "Arizona", "Nevada", "Utah"]

# # Filter for rows in the Mojave Desert states
# mojave_homelessness = homelessness[homelessness["state"].isin(canu)]

# # See the result
# print(mojave_homelessness)

# Adding new columns

You can add new columns to a DataFrame. This has many names, such as transforming, mutating, and feature engineering.

In [14]:
# # Add total col as sum of individuals and family_members
# homelessness ["total"] = homelessness ["individuals"] + homelessness ["family_members"]

# # Add p_individuals col as proportion of individuals
# homelessness ["p_individuals"] = homelessness ["individuals"] / homelessness ["total"]

# # See the result
# print(homelessness)

# Combo-attack!

Which state has the highest number of homeless individuals per 10,000 people in the state?

In [15]:
# # Create indiv_per_10k col as homeless individuals per 10k state pop
# homelessness["indiv_per_10k"] = 10000 * homelessness["individuals"] / homelessness["state_pop"] 

# # Subset rows for indiv_per_10k greater than 20
# high_homelessness = homelessness[homelessness["indiv_per_10k"] > 20]

# # Sort high_homelessness by descending indiv_per_10k
# high_homelessness_srt = high_homelessness.sort_values("indiv_per_10k",ascending = False)

# # From high_homelessness_srt, select the state and indiv_per_10k cols
# result = high_homelessness_srt[["state","indiv_per_10k"]]

# # See the result
# print(result)