<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Import-pandas-and-load-the-NLS-data" data-toc-modified-id="Import-pandas-and-load-the-NLS-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Import pandas and load the NLS data</a></span></li><li><span><a href="#Edit-all-the-values-based-on-a-scalar" data-toc-modified-id="Edit-all-the-values-based-on-a-scalar-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Edit all the values based on a scalar</a></span></li><li><span><a href="#Set-values-using-index-labels" data-toc-modified-id="Set-values-using-index-labels-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Set values using index labels</a></span></li><li><span><a href="#Set-values-using-an-operator-on-more-than-one-series" data-toc-modified-id="Set-values-using-an-operator-on-more-than-one-series-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Set values using an operator on more than one series</a></span></li><li><span><a href="#Set-the-values-for-a-summary-statistic-using-index-labels" data-toc-modified-id="Set-the-values-for-a-summary-statistic-using-index-labels-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Set the values for a summary statistic using index labels</a></span></li><li><span><a href="#Set-the-values-using-position" data-toc-modified-id="Set-the-values-using-position-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Set the values using position</a></span></li><li><span><a href="#Set-the-GPA-values-after-filtering" data-toc-modified-id="Set-the-GPA-values-after-filtering-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Set the GPA values after filtering</a></span></li></ul></div>

# Import pandas and load the NLS data

In [1]:
import pandas as pd

In [3]:
# pd.set_option('display.width', 200)
# pd.set_option('display.max_columns', 35)
# pd.set_option('display.max_rows', 200)
pd.options.display.float_format = '{:,.2f}'.format

In [4]:
import watermark
%load_ext watermark

%watermark -n -i -iv

watermark: 2.1.0
json     : 2.0.9
pandas   : 1.2.1



In [5]:
nls97 = pd.read_csv('data/nls97b.csv')
nls97.set_index('personid', inplace=True)

# Edit all the values based on a scalar

In [7]:
nls97['gpaoverall'].head()

personid
100061   3.06
100139    NaN
100284    NaN
100292   3.45
100583   2.91
Name: gpaoverall, dtype: float64

In [9]:
gpaoverall100 = nls97['gpaoverall'] * 100
gpaoverall100.head()

personid
100061   306.00
100139      NaN
100284      NaN
100292   345.00
100583   291.00
Name: gpaoverall, dtype: float64

# Set values using index labels

In [11]:
# Warning us about setting values on a copy of a DataFrame. nls97.gpaoverall.loc[[100061]] = 3
# triggers that warning, while - nls97.loc[[100061], 'gpaoverall'] = 3 does not

nls97.loc[[100061], 'gpaoverall'] = 3
nls97.loc[[100139, 100284, 100292], 'gpaoverall'] = 0
nls97['gpaoverall'].head()

personid
100061   3.00
100139   0.00
100284   0.00
100292   0.00
100583   2.91
Name: gpaoverall, dtype: float64

# Set values using an operator on more than one series

In [14]:
nls97['childnum'] = nls97['childathome'] + nls97['childnotathome']
nls97['childnum'].value_counts().sort_index()

0.00       23
1.00     1364
2.00     1729
3.00     1020
4.00      420
5.00      149
6.00       55
7.00       21
8.00        7
9.00        1
12.00       2
Name: childnum, dtype: int64

# Set the values for a summary statistic using index labels

In [16]:
nls97.loc[100061:100292, 'gpaoverall'] = nls97['gpaoverall'].mean()
nls97['gpaoverall'].head()

personid
100061   2.82
100139   2.82
100284   2.82
100292   2.82
100583   2.91
Name: gpaoverall, dtype: float64

# Set the values using position

In [21]:
nls97.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8984 entries, 100061 to 999963
Data columns (total 89 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   gender                 8984 non-null   object 
 1   birthmonth             8984 non-null   int64  
 2   birthyear              8984 non-null   int64  
 3   highestgradecompleted  6663 non-null   float64
 4   maritalstatus          6672 non-null   object 
 5   childathome            4791 non-null   float64
 6   childnotathome         4791 non-null   float64
 7   wageincome             5091 non-null   float64
 8   weeklyhrscomputer      6710 non-null   object 
 9   weeklyhrstv            6711 non-null   object 
 10  nightlyhrssleep        6706 non-null   float64
 11  satverbal              1406 non-null   float64
 12  satmath                1407 non-null   float64
 13  gpaoverall             6006 non-null   float64
 14  gpaenglish             5798 non-null   float64
 1

In [17]:
nls97.iloc[0, 13] = 2
nls97.iloc[1:4, 13] = 1
nls97['gpaoverall'].head()

personid
100061   2.00
100139   1.00
100284   1.00
100292   1.00
100583   2.91
Name: gpaoverall, dtype: float64

# Set the GPA values after filtering

In [23]:
nls97['gpaoverall'].nlargest()

personid
312410   4.17
639701   4.11
850001   4.10
279096   4.08
620216   4.07
Name: gpaoverall, dtype: float64

In [24]:
nls97.loc[nls97['gpaoverall'] > 4, 'gpaoverall'] = 4
nls97['gpaoverall'].nlargest()

personid
112756   4.00
119784   4.00
160193   4.00
250666   4.00
271961   4.00
Name: gpaoverall, dtype: float64

In [26]:
# nls97.loc[[100061], 'gpaoverall'] returns a series,
# while nls97.loc[[100061], ['gpaoverall']] returns a DataFrame:

type(nls97.loc[[100061], 'gpaoverall'])

pandas.core.series.Series

In [27]:
# If the second argument of the loc accessor is a string, it will return a series. If it is a list,
# even if the list contains only one item, it will return a DataFrame.

type(nls97.loc[[100061], ['gpaoverall']])

pandas.core.frame.DataFrame