# Inspecting a DataFrame

When you get a new DataFrame to work with, the first thing you need to do is explore it and see what it contains. There are several useful methods and attributes for this.

* .head() returns the first few rows (the “head” of the DataFrame).
* .info() shows information on each of the columns, such as the data type and number of missing values.
* .shape returns the number of rows and columns of the DataFrame.
* .describe() calculates a few summary statistics for each column.

homelessness is a DataFrame containing estimates of homelessness in each U.S. state in 2018. The individual column is the number of homeless individuals not part of a family with children. The family_members column is the number of homeless individuals part of a family with children. The state_pop column is the state's total population.

pandas is imported for you.

In [1]:
import pandas as pd

homelessness = pd.read_csv('/kaggle/input/homelessness/2007-2016-Homelessnewss-USA.csv')


In [2]:
# Print the head of the homelessness data
print(homelessness.head())

       Year State CoC Number       CoC Name  \
0  1/1/2007    AK     AK-500  Anchorage CoC   
1  1/1/2007    AK     AK-500  Anchorage CoC   
2  1/1/2007    AK     AK-500  Anchorage CoC   
3  1/1/2007    AK     AK-500  Anchorage CoC   
4  1/1/2007    AK     AK-500  Anchorage CoC   

                                     Measures Count  
0            Chronically Homeless Individuals   224  
1                        Homeless Individuals   696  
2                 Homeless People in Families   278  
3  Sheltered Chronically Homeless Individuals   187  
4                          Sheltered Homeless   842  


In [3]:
# Print information about homelessness
print(homelessness.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 86529 entries, 0 to 86528
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Year        86529 non-null  object
 1   State       86529 non-null  object
 2   CoC Number  86529 non-null  object
 3   CoC Name    86529 non-null  object
 4   Measures    86529 non-null  object
 5   Count       86529 non-null  object
dtypes: object(6)
memory usage: 4.0+ MB
None


In [4]:
# Print the shape of homelessness
print(homelessness.shape)


(86529, 6)


In [5]:
# Print a description of homelessness
print(homelessness.describe())

            Year  State CoC Number       CoC Name  \
count      86529  86529      86529          86529   
unique        10     54        414            414   
top     1/1/2015     CA     AK-500  Anchorage CoC   
freq       16926   8946        216            216   

                                Measures  Count  
count                              86529  86529  
unique                                42   3608  
top     Chronically Homeless Individuals      0  
freq                                3999  12209  


**Parts of a DataFrame**

To better understand DataFrame objects, it's useful to know that they consist of three components, stored as attributes:

.values: A two-dimensional NumPy array of values.

.columns: An index of columns: the column names.

.index: An index for the rows: either row numbers or row names.

You can usually think of indexes as a list of strings or numbers, though the pandas Index data type allows for more sophisticated options. (These will be covered later in the course.)



* Import pandas using the alias pd.
* Print a 2D NumPy array of the values in homelessness.
* Print the column names of homelessness.
* Print the index of homelessness.

In [6]:
# Import pandas using the alias pd
import pandas as pd

# Print the values of homelessness
print(homelessness.values)



[['1/1/2007' 'AK' 'AK-500' 'Anchorage CoC'
  'Chronically Homeless Individuals' '224']
 ['1/1/2007' 'AK' 'AK-500' 'Anchorage CoC' 'Homeless Individuals' '696']
 ['1/1/2007' 'AK' 'AK-500' 'Anchorage CoC' 'Homeless People in Families'
  '278']
 ...
 ['1/1/2016' 'WY' 'WY-500' 'Wyoming Statewide CoC'
  'Unsheltered Parenting Youth (Under 25)' '3']
 ['1/1/2016' 'WY' 'WY-500' 'Wyoming Statewide CoC'
  'Unsheltered Parenting Youth Age 18-24' '3']
 ['1/1/2016' 'WY' 'WY-500' 'Wyoming Statewide CoC'
  'Unsheltered Parenting Youth Under 18' '0']]


In [7]:
# Print the column index of homelessness
print(homelessness.columns)

Index(['Year', 'State', 'CoC Number', 'CoC Name', 'Measures', 'Count'], dtype='object')


In [8]:

# Print the row index of homelessness
print(homelessness.index)

RangeIndex(start=0, stop=86529, step=1)


# **Sorting rows**

Finding interesting bits of data in a DataFrame is often easier if you change the order of the rows. You can sort the rows by passing a column name to .sort_values().

In cases where rows have the same value (this is common if you sort on a categorical variable), you may wish to break the ties by sorting on another column. You can sort on multiple columns in this way by passing a list of column names.

Sort on …	Syntax
one column	df.sort_values("breed")
multiple columns	df.sort_values(["breed", "weight_kg"])
By combining .sort_values() with .head(), you can answer questions in the form, "What are the top cases where…?".


* Sort homelessness by the number of homeless individuals in the individuals column, from smallest to largest, and save this as homelessness_ind.
* Print the head of the sorted DataFrame.
* Sort homelessness by the number of homeless family_members in descending order, and save this as homelessness_fam.
* Sort homelessness first by region (ascending), and then by number of family members (descending). Save this as homelessness_reg_fam.

In [10]:
# Sort homelessness by individuals
homelessness_ind = homelessness.sort_values("Count")

# Print the top few rows
print(homelessness_ind.head())

           Year State CoC Number                    CoC Name  \
86528  1/1/2016    WY     WY-500       Wyoming Statewide CoC   
65135  1/1/2015    NY     NY-608  Kingston/Ulster County CoC   
65137  1/1/2015    NY     NY-608  Kingston/Ulster County CoC   
65140  1/1/2015    NY     NY-608  Kingston/Ulster County CoC   
65144  1/1/2015    NY     NY-608  Kingston/Ulster County CoC   

                                                Measures Count  
86528               Unsheltered Parenting Youth Under 18     0  
65135                 Sheltered Parenting Youth Under 18     0  
65137            Unsheltered Children of Parenting Youth     0  
65140  Unsheltered Chronically Homeless People in Fam...     0  
65144  Unsheltered Homeless Unaccompanied Children (U...     0  


In [12]:
# Sort homelessness by descending family members
homelessness_fam = homelessness.sort_values("CoC Number", ascending = False)

print(homelessness_fam.head())

           Year State CoC Number               CoC Name  \
86528  1/1/2016    WY     WY-500  Wyoming Statewide CoC   
52710  1/1/2014    WY     WY-500  Wyoming Statewide CoC   
52718  1/1/2014    WY     WY-500  Wyoming Statewide CoC   
52717  1/1/2014    WY     WY-500  Wyoming Statewide CoC   
52716  1/1/2014    WY     WY-500  Wyoming Statewide CoC   

                                      Measures Count  
86528     Unsheltered Parenting Youth Under 18     0  
52710              Sheltered Homeless Veterans    95  
52718            Unsheltered Homeless Veterans    21  
52717  Unsheltered Homeless People in Families    58  
52716         Unsheltered Homeless Individuals   136  


In [13]:
# Sort homelessness by region, then descending family members
homelessness_reg_fam = homelessness.sort_values(["Year", "CoC Name"], ascending = [True, False])

# Print the top few rows
print(homelessness_reg_fam.head())

         Year State CoC Number                              CoC Name  \
552  1/1/2007    CA     CA-524  Yuba City & County/Sutter County CoC   
553  1/1/2007    CA     CA-524  Yuba City & County/Sutter County CoC   
554  1/1/2007    CA     CA-524  Yuba City & County/Sutter County CoC   
555  1/1/2007    CA     CA-524  Yuba City & County/Sutter County CoC   
556  1/1/2007    CA     CA-524  Yuba City & County/Sutter County CoC   

                                       Measures Count  
552            Chronically Homeless Individuals    44  
553                        Homeless Individuals   212  
554                 Homeless People in Families   150  
555  Sheltered Chronically Homeless Individuals     9  
556                          Sheltered Homeless   299  


# **Subsetting columns**

When working with data, you may not need all of the variables in your dataset. Square brackets ([]) can be used to select only the columns that matter to you in an order that makes sense to you. To select only "col_a" of the DataFrame df, use

df["col_a"]
To select "col_a" and "col_b" of df, use

df[["col_a", "col_b"]]
homelessness is available and pandas is loaded as pd.


* Create a Series called individuals that contains only the individuals column of homelessness.
* Create a DataFrame called state_fam that contains only the state and family_members columns of homelessness, in that order.
* Create a DataFrame called ind_state that contains the individuals and state columns of homelessness, in that order.

In [14]:
# Select the individuals column
individuals = homelessness["Year"]

print(individuals.head())

0    1/1/2007
1    1/1/2007
2    1/1/2007
3    1/1/2007
4    1/1/2007
Name: Year, dtype: object


In [17]:
# Select only the individuals and state columns, in that order
ind_state = homelessness[["CoC Number", "CoC Name"]]

print(ind_state.head())

  CoC Number       CoC Name
0     AK-500  Anchorage CoC
1     AK-500  Anchorage CoC
2     AK-500  Anchorage CoC
3     AK-500  Anchorage CoC
4     AK-500  Anchorage CoC


**Subsetting rows**

A large part of data science is about finding which bits of your dataset are interesting. One of the simplest techniques for this is to find a subset of rows that match some criteria. This is sometimes known as filtering rows or selecting rows.

There are many ways to subset a DataFrame, perhaps the most common is to use relational operators to return True or False for each row, then pass that inside square brackets.

dogs[dogs["height_cm"] > 60]

dogs[dogs["color"] == "tan"]

You can filter for multiple conditions at once by using the "bitwise and" operator, &.

dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")]

homelessness is available and pandas is loaded as pd.


* Filter homelessness for cases where the number of individuals is greater than ten thousand, assigning to ind_gt_10k. View the printed result.

* Filter homelessness for cases where the USA Census region is "Mountain", assigning to mountain_reg. View the printed result.

* Filter homelessness for cases where the number of family_members is less than one thousand and the region is "Pacific", assigning to fam_lt_1k_pac. View the printed result.

In [26]:
import pandas as pd

# Convert the 'Count' column to numeric, forcing errors to NaN (if any)
homelessness["Count"] = pd.to_numeric(homelessness["Count"], errors='coerce')

# Filter for rows where 'Count' is greater than or equal to 10000
ind_gt_10k = homelessness[homelessness["Count"] >= 500]

# See the result
print(ind_gt_10k)


           Year State CoC Number                            CoC Name  \
1      1/1/2007    AK     AK-500                       Anchorage CoC   
4      1/1/2007    AK     AK-500                       Anchorage CoC   
5      1/1/2007    AK     AK-500                       Anchorage CoC   
7      1/1/2007    AK     AK-500                       Anchorage CoC   
16     1/1/2007    AK     AK-501         Alaska Balance of State CoC   
...         ...   ...        ...                                 ...   
86449  1/1/2016    WV     WV-508  West Virginia Balance of State CoC   
86462  1/1/2016    WV     WV-508  West Virginia Balance of State CoC   
86472  1/1/2016    WV     WV-508  West Virginia Balance of State CoC   
86491  1/1/2016    WY     WY-500               Wyoming Statewide CoC   
86514  1/1/2016    WY     WY-500               Wyoming Statewide CoC   

                             Measures  Count  
1                Homeless Individuals  696.0  
4                  Sheltered Homeless  84

In [27]:
# Filter for rows where region is Mountain
mountain_reg = homelessness[homelessness['CoC Name'] == "Wyoming Statewide CoC" ]

# See the result
print(mountain_reg)

           Year State CoC Number               CoC Name  \
4812   1/1/2007    WY     WY-500  Wyoming Statewide CoC   
4813   1/1/2007    WY     WY-500  Wyoming Statewide CoC   
4814   1/1/2007    WY     WY-500  Wyoming Statewide CoC   
4815   1/1/2007    WY     WY-500  Wyoming Statewide CoC   
4816   1/1/2007    WY     WY-500  Wyoming Statewide CoC   
...         ...   ...        ...                    ...   
86524  1/1/2016    WY     WY-500  Wyoming Statewide CoC   
86525  1/1/2016    WY     WY-500  Wyoming Statewide CoC   
86526  1/1/2016    WY     WY-500  Wyoming Statewide CoC   
86527  1/1/2016    WY     WY-500  Wyoming Statewide CoC   
86528  1/1/2016    WY     WY-500  Wyoming Statewide CoC   

                                                Measures  Count  
4812                    Chronically Homeless Individuals   38.0  
4813                                Homeless Individuals  331.0  
4814                         Homeless People in Families  206.0  
4815          Sheltered Chr

In [28]:
# Filter for rows where family_members is less than 1000 
# and region is Pacific
fam_lt_1k_pac = homelessness[(homelessness['CoC Name'] == 'Wyoming Statewide CoC') & (homelessness['Count'] <= 200)]

# See the result
print(fam_lt_1k_pac)

           Year State CoC Number               CoC Name  \
4812   1/1/2007    WY     WY-500  Wyoming Statewide CoC   
4815   1/1/2007    WY     WY-500  Wyoming Statewide CoC   
4818   1/1/2007    WY     WY-500  Wyoming Statewide CoC   
4820   1/1/2007    WY     WY-500  Wyoming Statewide CoC   
4821   1/1/2007    WY     WY-500  Wyoming Statewide CoC   
...         ...   ...        ...                    ...   
86524  1/1/2016    WY     WY-500  Wyoming Statewide CoC   
86525  1/1/2016    WY     WY-500  Wyoming Statewide CoC   
86526  1/1/2016    WY     WY-500  Wyoming Statewide CoC   
86527  1/1/2016    WY     WY-500  Wyoming Statewide CoC   
86528  1/1/2016    WY     WY-500  Wyoming Statewide CoC   

                                                Measures  Count  
4812                    Chronically Homeless Individuals   38.0  
4815          Sheltered Chronically Homeless Individuals    0.0  
4818               Sheltered Homeless People in Families  140.0  
4820        Unsheltered Chr

Superb subsetting! Using square brackets plus logical conditions is often the most powerful way of identifying interesting rows of data.

**Subsetting rows by categorical variables**

Subsetting data based on a categorical variable often involves using the "or" operator (|) to select rows from multiple categories. This can get tedious when you want all states in one of three different regions, for example. Instead, use the .isin() method, which will allow you to tackle this problem by writing one condition instead of three separate ones.

colors = ["brown", "black", "tan"]

condition = dogs["color"].isin(colors)

dogs[condition]


* Filter homelessness for cases where the USA census state is in the list of Mojave states, canu, assigning to mojave_homelessness. View the printed result.

In [29]:
# The Mojave Desert states
measures_homeless = ["Chronically Homeless Individuals", "Sheltered Chronically Homeless Individuals", "Unsheltered Homeless",
        "Unsheltered Homeless Veterans"]

# Filter for rows in the Mojave Desert states
mojave_homelessness = homelessness[homelessness["Measures"].isin(measures_homeless)]

# See the result
print(mojave_homelessness)

           Year State CoC Number                            CoC Name  \
0      1/1/2007    AK     AK-500                       Anchorage CoC   
3      1/1/2007    AK     AK-500                       Anchorage CoC   
9      1/1/2007    AK     AK-500                       Anchorage CoC   
12     1/1/2007    AK     AK-501         Alaska Balance of State CoC   
15     1/1/2007    AK     AK-501         Alaska Balance of State CoC   
...         ...   ...        ...                                 ...   
86483  1/1/2016    WV     WV-508  West Virginia Balance of State CoC   
86489  1/1/2016    WY     WY-500               Wyoming Statewide CoC   
86502  1/1/2016    WY     WY-500               Wyoming Statewide CoC   
86519  1/1/2016    WY     WY-500               Wyoming Statewide CoC   
86525  1/1/2016    WY     WY-500               Wyoming Statewide CoC   

                                         Measures  Count  
0                Chronically Homeless Individuals  224.0  
3      Sheltered 