# One Solution to In-Class Exercises

## Working with Pandas -- an in-class working session

OK, let's create a DataFrame from a dictionary, following the example on pg 116 of Python for Data Analysis (PDA).

In [1]:
import pandas as pd
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
df = pd.DataFrame(data)

Explain the contents and structure of 'data'

What does 'DataFrame(data)' do? What if we did not begin that line with 'df ='?

Look at the contents of df, using just df by itself, and 'print df'.  

In [None]:
df

We can refer to a column in two ways:

In [None]:
df['state']

In [None]:
df.state

We can step through the rows in a dataframe

In [None]:
for label in df.state.index:
    print(df.state[label])

And find the index value within each entry of a specific substring

In [None]:
for label in df.state.index:
    print(df.state[label].find('io'))

In [None]:
for label in df.state.index:
    if df.state[label]=='Ohio':
        print(df.state[label])
    else:
        print('Missing')

Below are a series of questions, with the answers remaining for you to fill in by using pandas expressions that draw on the methods in Chapter 5.  You should not need to use anything more than the content of this chapter -- a subset of the methods summarized above, to do this exercise.  Hopefully you can complete it within class if you've been keeping up with the reading.

How can we get a quick statistical profile of all the numeric columns?

In [None]:
df.describe()

Can you get a profile of a column that is not numeric, like state? Try it.

In [None]:
df['state'].describe()

How can we print the data types of each column?

In [None]:
df.dtypes

How can we print just the column containing state names?

In [None]:
df['state']

How can we get a list of the states in the DataFrame, without duplicates?

In [None]:
df['state'].unique()

How can we get a count of how many rows we have in each state?

In [None]:
df['state'].value_counts()

How can we compute the mean of population across all the rows?

In [None]:
df['pop'].mean()

How can we compute the maximum population across all the rows?

In [None]:
df['pop'].max()

How can we compute the 20th percentile value of population? 

In [None]:
df['pop'].quantile(.2)

How can we compute a Boolean array indicating whether the state is 'Ohio'?

In [None]:
df['state']=='Ohio'

How can we select and print just the rows for Ohio?

In [None]:
df[df['state']=='Ohio']

How can we create a new DataFrame containing only the Ohio records?

In [None]:
dfnew = df[df['state']=='Ohio']
dfnew

How can we select and print just the rows in which population is more than 2?

In [None]:
df[df['pop']>2]

How could we compute the mean of population that is in Ohio, averaging across years?

In [None]:
df[df['state']=='Ohio']['pop'].mean()

How can we print the DataFrame, sorted by State and within State, by Population?

In [None]:
df.sort_values(by=['state','pop'])

How can we print the row for Ohio, 2002, selecting on its values (not on row and column indexes)?

This solution uses an & operator to set two conditions that must both be met:

In [14]:
df[(df['state']=='Ohio') & (df['year']==2002)]

Unnamed: 0,pop,state,year
2,3.6,Ohio,2002


Breaking down the preceding solution into smaller steps, it is easier to see how this works

In [12]:
is_ohio = df['state']=='Ohio'
is_2002 = df['year']==2002
df[is_ohio & is_2002]

Unnamed: 0,pop,state,year
2,3.6,Ohio,2002


How can we use row and column indexing to set the population of Ohio in 2002 to 3.4?

In [None]:
df.loc[2, 'pop'] = 3.6
df

How can we use row and column indexing to append a new record for Utah, initially with no population or year? 

In [None]:
df.loc[5, 'state'] = 'Utah'
df

How can we set the population to 2.5 and year to 2001 for the new record?

In [None]:
df.loc[5, 'pop'] = 2.5
df.loc[5, 'year'] = 2001
df