# 2. Refine the Data
 
> "Data is messy"

- **Missing** e.g. Check for missing or incomplete data
- **Quality** e.g. Check for duplicates, accuracy, unusual data
- **Parse** e.g. extract year and month from date
- **Convert** e.g. free text to coded value
- **Derive** e.g. gender from title
- **Calculate** e.g. percentages, proportion
- **Remove** e.g. remove redundant data
- **Merge** e.g. first and surname for full name
- **Aggregate** e.g. rollup by year, cluster by area
- **Filter** e.g. exclude based on location
- **Sample** e.g. extract a representative data
- **Summary** e.g. show summary stats like mean

In [1]:
# Load the libraries
import numpy as np
import pandas as pd

In [3]:
# Load the data again!
df = pd.read_csv("https://raw.githubusercontent.com/reddyprasade/Data-Analysis-with-Python/main/Statistics/Data/Weed_Price.csv", parse_dates=[-1])

In [4]:
df

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
0,Alabama,339.06,1042,198.64,933,149.49,123,2014-01-01
1,Alaska,288.75,252,260.60,297,388.58,26,2014-01-01
2,Arizona,303.31,1941,209.35,1625,189.45,222,2014-01-01
3,Arkansas,361.85,576,185.62,544,125.87,112,2014-01-01
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01
...,...,...,...,...,...,...,...,...
22894,Virginia,364.98,3513,293.12,3079,,284,2014-12-31
22895,Washington,233.05,3337,189.92,3562,,160,2014-12-31
22896,West Virginia,359.35,551,224.03,545,,60,2014-12-31
22897,Wisconsin,350.52,2244,272.71,2221,,167,2014-12-31


## 2.1 Missing Data

By “missing” data we simply mean null or “not present for whatever reason”. Lets see if we can find the missing data in our data set either because it exists and was not collected or it never existed

In [5]:
# Lets start the count to seeing about missing data
df.count()

State     22899
HighQ     22899
HighQN    22899
MedQ      22899
MedQN     22899
LowQ      12342
LowQN     22899
date      22899
dtype: int64

In [7]:
df.isnull().sum()

State         0
HighQ         0
HighQN        0
MedQ          0
MedQN         0
LowQ      10557
LowQN         0
date          0
dtype: int64

In [6]:
df['LowQ'].isnull().sum()

10557

In [8]:
# We can see the bottom rows which have NaN values
df.tail()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
22894,Virginia,364.98,3513,293.12,3079,,284,2014-12-31
22895,Washington,233.05,3337,189.92,3562,,160,2014-12-31
22896,West Virginia,359.35,551,224.03,545,,60,2014-12-31
22897,Wisconsin,350.52,2244,272.71,2221,,167,2014-12-31
22898,Wyoming,322.27,131,351.86,197,,12,2014-12-31


**Pandas will represent missing value by NaN**

What can we do this with missing value?
- Drop these rows / columns? Use `.dropna(how='any')`
- Fill with a dummy value? Use `.fillna(value=dummy)`
- Impute the cell with the most recent value? Use `.fillna(method='ffill')`
- Interpolate the amount in a linear fashion? Use `.interpolate()`

We use the `inplace = True` operator to avoid making a copy of the dataframe and changing the dataframe itself

In [13]:
# Lets sort this data frame by State and Date
df.sort_values(by=['State','date'], inplace=True)

In [14]:
# Lets fill the missing value with last available value
df.fillna(method = "ffill", inplace=True)

In [15]:
df.count()

State     22899
HighQ     22899
HighQN    22899
MedQ      22899
MedQN     22899
LowQ      22899
LowQN     22899
date      22899
dtype: int64

In [16]:
df.tail()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date
4997,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-07
5762,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-08
6527,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-09
7343,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-10
8159,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-11


### Exercise

Fill the missing value with a backward fill.

Fill the missing values with the mean for the column.

## 2.2 Quality of the Data 

Lets check for completeness.

**Say, do we have data on each date for all the 51 states?**

In [17]:
df["State"].value_counts()

Texas                   449
New Mexico              449
Kansas                  449
South Dakota            449
Georgia                 449
North Carolina          449
Illinois                449
West Virginia           449
New York                449
Tennessee               449
South Carolina          449
North Dakota            449
Maryland                449
Nebraska                449
Washington              449
Oklahoma                449
Pennsylvania            449
Arkansas                449
New Jersey              449
Rhode Island            449
Vermont                 449
Arizona                 449
Wisconsin               449
Minnesota               449
Maine                   449
Kentucky                449
Wyoming                 449
Delaware                449
Iowa                    449
Virginia                449
Louisiana               449
Connecticut             449
Nevada                  449
Ohio                    449
California              449
New Hampshire       

**Lets check the dates and see if they are all continuous**

In [18]:
df1 = df[df.State=='California'].copy()
df2 = df[df.State=='California'].copy()

In [19]:
df1.shape

(449, 8)

In [20]:
print("Earliest Date:",df1.date.min(), "\n", "Latest Date:",df1.date.max(), "\n",  \
"Number of days in between them:", df1.date.max() - df1.date.min() )

Earliest Date: 2013-12-27 00:00:00 
 Latest Date: 2015-06-11 00:00:00 
 Number of days in between them: 531 days 00:00:00


In [21]:
df1.groupby(['date']).size()

date
2013-12-27    1
2013-12-28    1
2013-12-29    1
2013-12-30    1
2013-12-31    1
             ..
2015-06-07    1
2015-06-08    1
2015-06-09    1
2015-06-10    1
2015-06-11    1
Length: 449, dtype: int64

In [22]:
df1.set_index("date", inplace=True)
df2.set_index("date", inplace=True)

In [23]:
idx = pd.date_range(df1.index.min(), df1.index.max())

In [24]:
idx

DatetimeIndex(['2013-12-27', '2013-12-28', '2013-12-29', '2013-12-30',
               '2013-12-31', '2014-01-01', '2014-01-02', '2014-01-03',
               '2014-01-04', '2014-01-05',
               ...
               '2015-06-02', '2015-06-03', '2015-06-04', '2015-06-05',
               '2015-06-06', '2015-06-07', '2015-06-08', '2015-06-09',
               '2015-06-10', '2015-06-11'],
              dtype='datetime64[ns]', length=532, freq='D')

In [25]:
df1 = df1.reindex(idx, fill_value=0)

In [26]:
df1.shape

(532, 7)

**Exercise** Show the list of dates that were missing. *Hint* Leverage df2. Both df1 and df2 have `date` as index

## 2.3  Parse the Data

Lets see if we can get the year, month, week and weekdays from the date. Pandas has got good built in functionality for timeseries data using the DatetimeIndex method 

In [27]:
df['year'] = pd.DatetimeIndex(df['date']).year
df['month'] = pd.DatetimeIndex(df['date']).month
df['week'] = pd.DatetimeIndex(df['date']).week
df['weekday'] = pd.DatetimeIndex(df['date']).weekday

  This is separate from the ipykernel package so we can avoid doing imports until


In [28]:
df.dtypes

State              object
HighQ             float64
HighQN              int64
MedQ              float64
MedQN               int64
LowQ              float64
LowQN               int64
date       datetime64[ns]
year                int64
month               int64
week                int64
weekday             int64
dtype: object

In [21]:
df.tail()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
4997,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-07,2015,6,23,6
5762,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-08,2015,6,24,0
6527,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-09,2015,6,24,1
7343,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-10,2015,6,24,2
8159,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-11,2015,6,24,3


In [29]:
df['year'].value_counts()

2014    18564
2015     4080
2013      255
Name: year, dtype: int64

In [23]:
df['month'].value_counts()

1     3162
5     2703
2     2244
6     2091
12    1836
10    1581
7     1581
3     1581
11    1530
9     1530
8     1530
4     1530
dtype: int64

In [30]:
df["weekday"].value_counts()

0    3315
6    3315
1    3264
2    3264
3    3264
4    3264
5    3213
Name: weekday, dtype: int64

In [31]:
# Wrtie a code for find the For weekend 

## 2.4 Aggregate the Data

To aggregate, we typically use the “group by” function, which involves the following steps

- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

In [32]:
df_mean = df.groupby("State", as_index=False).mean()

In [33]:
df_mean

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,year,month,week,weekday
0,Alabama,339.561849,1379.414254,204.606169,1270.351893,145.978508,161.14922,2014.167038,5.953229,23.812918,2.995546
1,Alaska,291.482004,321.244989,262.046392,407.917595,394.653964,32.334076,2014.167038,5.953229,23.812918,2.995546
2,Arizona,300.667483,2392.465479,209.365345,2137.414254,188.500134,279.006682,2014.167038,5.953229,23.812918,2.995546
3,Arkansas,348.056147,751.988864,190.414655,724.683742,126.771269,135.902004,2014.167038,5.953229,23.812918,2.995546
4,California,245.376125,14947.073497,191.268909,16769.821826,189.783586,976.298441,2014.167038,5.953229,23.812918,2.995546
5,Colorado,238.918708,2816.218263,196.532517,2457.512249,226.781114,165.349666,2014.167038,5.953229,23.812918,2.995546
6,Connecticut,341.694076,1625.120267,271.323898,1777.227171,251.625724,110.229399,2014.167038,5.953229,23.812918,2.995546
7,Delaware,366.781849,440.971047,231.230312,372.587973,204.960245,39.175947,2014.167038,5.953229,23.812918,2.995546
8,District of Columbia,348.177416,575.091314,288.251314,494.650334,210.225367,46.583519,2014.167038,5.953229,23.812918,2.995546
9,Florida,302.570312,8415.03118,217.882561,7127.216036,152.285457,632.077951,2014.167038,5.953229,23.812918,2.995546


In [26]:
df_mean.shape

(51, 11)

In [34]:
df_mean.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,year,month,week,weekday
0,Alabama,339.561849,1379.414254,204.606169,1270.351893,145.978508,161.14922,2014.167038,5.953229,23.812918,2.995546
1,Alaska,291.482004,321.244989,262.046392,407.917595,394.653964,32.334076,2014.167038,5.953229,23.812918,2.995546
2,Arizona,300.667483,2392.465479,209.365345,2137.414254,188.500134,279.006682,2014.167038,5.953229,23.812918,2.995546
3,Arkansas,348.056147,751.988864,190.414655,724.683742,126.771269,135.902004,2014.167038,5.953229,23.812918,2.995546
4,California,245.376125,14947.073497,191.268909,16769.821826,189.783586,976.298441,2014.167038,5.953229,23.812918,2.995546


Pivot Table

In [28]:
df.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
20094,Alabama,339.65,1033,198.04,926,147.15,122,2013-12-27,2013,12,52,4
20859,Alabama,339.65,1033,198.04,926,147.15,122,2013-12-28,2013,12,52,5
21573,Alabama,339.75,1036,198.26,929,149.49,123,2013-12-29,2013,12,52,6
22287,Alabama,339.75,1036,198.81,930,149.49,123,2013-12-30,2013,12,1,0
22797,Alabama,339.42,1040,198.68,932,149.49,123,2013-12-31,2013,12,1,1


In [29]:
pd.pivot_table(df, values='HighQ', index=['State'], columns=["weekday"] )

weekday,0,1,2,3,4,5,6
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Alabama,339.556308,339.577656,339.559375,339.5525,339.577969,339.568254,339.541385
Alaska,291.463231,291.384687,291.39,291.5975,291.506406,291.596032,291.438923
Arizona,300.612,300.653906,300.64375,300.655,300.751562,300.715873,300.642308
Arkansas,347.930462,348.01125,348.017656,347.934844,348.278281,348.235238,347.991077
California,245.348923,245.364375,245.359219,245.342813,245.434219,245.425556,245.359231
Colorado,238.948462,238.885938,238.889844,238.977187,238.885937,238.897619,238.944769
Connecticut,341.624615,341.682812,341.63375,341.616406,341.832656,341.819683,341.652308
Delaware,366.712769,366.74875,366.705937,366.659375,366.962344,366.937143,366.750615
District of Columbia,348.205385,348.055,348.132656,348.145625,348.213594,348.277778,348.212462
Florida,302.548923,302.577031,302.529688,302.510781,302.638437,302.648095,302.541231


**Exercise** Get a cross tabulation: for each state, for each month, get the prices for the weekday as shown in the output 

## 2.5 Derive the Data

Lets us load the demographic dataset and create a new column for others in the population

In [35]:
df_demo = pd.read_csv("https://raw.githubusercontent.com/reddyprasade/Data-Analysis-with-Python/main/Statistics/Data/Demographics_State.csv")

In [36]:
df_demo.head()

Unnamed: 0,region,total_population,percent_white,percent_black,percent_asian,percent_hispanic,per_capita_income,median_rent,median_age
0,alabama,4799277,67,26,1,4,23680,501,38.1
1,alaska,720316,63,3,5,6,32651,978,33.6
2,arizona,6479703,57,4,3,30,25358,747,36.3
3,arkansas,2933369,74,15,1,7,22170,480,37.5
4,california,37659181,40,6,13,38,29527,1119,35.4


In [37]:
df_demo["percent_other"] = 100 - df_demo["percent_white"] - df_demo["percent_black"] - df_demo["percent_asian"] - df_demo["percent_hispanic"]

In [38]:
df_demo.head()

Unnamed: 0,region,total_population,percent_white,percent_black,percent_asian,percent_hispanic,per_capita_income,median_rent,median_age,percent_other
0,alabama,4799277,67,26,1,4,23680,501,38.1,2
1,alaska,720316,63,3,5,6,32651,978,33.6,23
2,arizona,6479703,57,4,3,30,25358,747,36.3,6
3,arkansas,2933369,74,15,1,7,22170,480,37.5,3
4,california,37659181,40,6,13,38,29527,1119,35.4,3


**Exercise** Express median rent as a proportion of california's median rent (Compute it as a new column)

## 2.6 Merge the Data 

Lets merge the demographic dataset with the price dataset

In [39]:
# Let us change the column name region to State
df_demo = df_demo.rename(columns={'region':'State'})

In [40]:
df_demo.head()

Unnamed: 0,State,total_population,percent_white,percent_black,percent_asian,percent_hispanic,per_capita_income,median_rent,median_age,percent_other
0,alabama,4799277,67,26,1,4,23680,501,38.1,2
1,alaska,720316,63,3,5,6,32651,978,33.6,23
2,arizona,6479703,57,4,3,30,25358,747,36.3,6
3,arkansas,2933369,74,15,1,7,22170,480,37.5,3
4,california,37659181,40,6,13,38,29527,1119,35.4,3


In [41]:
# We can now merge Demographic and Price mean data into one single data frame
df_merge = pd.merge(df_mean, df_demo, how='inner', on='State')
df_merge.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,year,month,week,weekday,total_population,percent_white,percent_black,percent_asian,percent_hispanic,per_capita_income,median_rent,median_age,percent_other


What happened? Why is there no data in the dataframe?

In [42]:
# Change the State in df_mean to lower case
df_mean['State'] = df_mean['State'].str.lower()

In [43]:
df_mean.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,year,month,week,weekday
0,alabama,339.561849,1379.414254,204.606169,1270.351893,145.978508,161.14922,2014.167038,5.953229,23.812918,2.995546
1,alaska,291.482004,321.244989,262.046392,407.917595,394.653964,32.334076,2014.167038,5.953229,23.812918,2.995546
2,arizona,300.667483,2392.465479,209.365345,2137.414254,188.500134,279.006682,2014.167038,5.953229,23.812918,2.995546
3,arkansas,348.056147,751.988864,190.414655,724.683742,126.771269,135.902004,2014.167038,5.953229,23.812918,2.995546
4,california,245.376125,14947.073497,191.268909,16769.821826,189.783586,976.298441,2014.167038,5.953229,23.812918,2.995546


In [44]:
# We can now merge Demographic and Price mean data into one single data frame
df_merge = pd.merge(df_mean, df_demo, how='inner', on='State')

In [45]:
df_merge.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,year,month,week,weekday,total_population,percent_white,percent_black,percent_asian,percent_hispanic,per_capita_income,median_rent,median_age,percent_other
0,alabama,339.561849,1379.414254,204.606169,1270.351893,145.978508,161.14922,2014.167038,5.953229,23.812918,2.995546,4799277,67,26,1,4,23680,501,38.1,2
1,alaska,291.482004,321.244989,262.046392,407.917595,394.653964,32.334076,2014.167038,5.953229,23.812918,2.995546,720316,63,3,5,6,32651,978,33.6,23
2,arizona,300.667483,2392.465479,209.365345,2137.414254,188.500134,279.006682,2014.167038,5.953229,23.812918,2.995546,6479703,57,4,3,30,25358,747,36.3,6
3,arkansas,348.056147,751.988864,190.414655,724.683742,126.771269,135.902004,2014.167038,5.953229,23.812918,2.995546,2933369,74,15,1,7,22170,480,37.5,3
4,california,245.376125,14947.073497,191.268909,16769.821826,189.783586,976.298441,2014.167038,5.953229,23.812918,2.995546,37659181,40,6,13,38,29527,1119,35.4,3


## 2.7 Filter the Data

Lets start by filtering the data 
- by location
- by Year
- by location & Year

In [46]:
# Filter data for location California
df_cal = df[df["State"] == "California"]

In [47]:
df_cal.shape

(449, 12)

In [48]:
# Filter data for year
df_2014 = df[df["year"] == 2014]

In [49]:
df_2014.shape

(18564, 12)

In [50]:
df_cal_2014 = df[(df["year"] == 2014) & (df["State"] == "California")]

In [51]:
df_cal_2014.shape

(364, 12)

In [52]:
df_cal_2014.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
4,California,248.78,12096,193.56,12812,192.92,778,2014-01-01,2014,1,1,2
769,California,248.67,12125,193.56,12836,192.8,779,2014-01-02,2014,1,1,3
1483,California,248.67,12141,193.57,12853,192.67,782,2014-01-03,2014,1,1,4
2248,California,248.65,12155,193.59,12884,192.67,782,2014-01-04,2014,1,1,5
3013,California,248.68,12176,193.63,12902,192.67,782,2014-01-05,2014,1,1,6


**Exercise** Find the % of hispanic population for the state with max white population (use `df_demo`)

## 2.8 Summarise the Data

We can use the describe function to get the summary stats for each column in the data frame

In [53]:
df.describe()

Unnamed: 0,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,year,month,week,weekday
count,22899.0,22899.0,22899.0,22899.0,22899.0,22899.0,22899.0,22899.0,22899.0,22899.0
mean,329.759854,2274.743657,247.618306,2183.737805,203.624092,202.804489,2014.167038,5.953229,23.812918,2.995546
std,41.173167,2641.936586,44.276015,2789.902626,101.484265,220.531987,0.401765,3.553055,15.426018,2.005599
min,202.02,93.0,144.85,134.0,63.7,11.0,2013.0,1.0,1.0,0.0
25%,303.78,597.0,215.775,548.0,145.81,51.0,2014.0,3.0,9.0,1.0
50%,342.31,1420.0,245.8,1320.0,185.78,139.0,2014.0,6.0,22.0,3.0
75%,356.55,2958.0,274.155,2673.0,222.94,263.0,2014.0,9.0,37.0,5.0
max,415.7,18492.0,379.0,22027.0,734.65,1287.0,2015.0,12.0,52.0,6.0


We can also use convenience functions like sum(), count(), mean() etc. to calculate these

In [54]:
df.HighQ.mean()

329.75985414210226

In [55]:
# Lets do this the hard way
df.HighQ.sum()

7551170.899999999

In [56]:
df.HighQ.count()

22899

In [57]:
df.HighQ.sum()/df.HighQ.count()

329.75985414210226

In [58]:
df.HighQ.median()

342.31

In [59]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22899 entries, 20094 to 8159
Data columns (total 12 columns):
 #   Column   Non-Null Count  Dtype         
---  ------   --------------  -----         
 0   State    22899 non-null  object        
 1   HighQ    22899 non-null  float64       
 2   HighQN   22899 non-null  int64         
 3   MedQ     22899 non-null  float64       
 4   MedQN    22899 non-null  int64         
 5   LowQ     22899 non-null  float64       
 6   LowQN    22899 non-null  int64         
 7   date     22899 non-null  datetime64[ns]
 8   year     22899 non-null  int64         
 9   month    22899 non-null  int64         
 10  week     22899 non-null  int64         
 11  weekday  22899 non-null  int64         
dtypes: datetime64[ns](1), float64(3), int64(7), object(1)
memory usage: 2.3+ MB


In [60]:
df.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
20094,Alabama,339.65,1033,198.04,926,147.15,122,2013-12-27,2013,12,52,4
20859,Alabama,339.65,1033,198.04,926,147.15,122,2013-12-28,2013,12,52,5
21573,Alabama,339.75,1036,198.26,929,149.49,123,2013-12-29,2013,12,52,6
22287,Alabama,339.75,1036,198.81,930,149.49,123,2013-12-30,2013,12,1,0
22797,Alabama,339.42,1040,198.68,932,149.49,123,2013-12-31,2013,12,1,1


In [61]:
df.tail()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
4997,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-07,2015,6,23,6
5762,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-08,2015,6,24,0
6527,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-09,2015,6,24,1
7343,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-10,2015,6,24,2
8159,Wyoming,313.72,148,317.38,226,161.3,13,2015-06-11,2015,6,24,3


## 2.9 Sample the Data

In [62]:
?df.sample

In [63]:
df_ca_sample = df[df.State=='California'].sample(n = 50, replace = True, random_state=123)

In [64]:
df_ca_sample.duplicated()

20914    False
10051    False
11428    False
2503     False
11275    False
9235     False
14590    False
8725     False
21118    False
15967    False
21322    False
7399     False
1024     False
13927    False
1075     False
8572     False
7042     False
20200    False
17599    False
6583     False
12499    False
6991     False
1432     False
20047    False
5104     False
10357    False
1024      True
16987    False
3217     False
21475    False
2911     False
21577    False
2197     False
2350     False
11428     True
15304    False
8572      True
5716     False
15508    False
7960     False
17956    False
3268     False
8521     False
8419     False
18364    False
15814    False
12244    False
20608    False
17497    False
9949     False
dtype: bool

In [65]:
df_ca_sample.loc[8572]

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
8572,California,248.26,12753,192.89,13618,191.0,818,2014-02-12,2014,2,7,2
8572,California,248.26,12753,192.89,13618,191.0,818,2014-02-12,2014,2,7,2


## 2.10 Quirks in Pandas

In [66]:
df_ca_sample.iat[0, 0] = "Cal"

In [67]:
df_ca_sample.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
20914,Cal,244.05,16474,189.42,19085,188.6,1093,2014-12-28,2014,12,52,6
10051,California,243.6,16664,189.19,19416,188.6,1109,2015-01-14,2015,1,3,2
11428,California,244.75,15893,190.43,18113,188.6,1042,2014-11-15,2014,11,46,5
2503,California,246.53,13435,192.83,14590,191.62,866,2014-04-04,2014,4,14,4
11275,California,245.27,15108,191.6,16881,187.94,989,2014-08-15,2014,8,33,4


But changes when you copy

In [68]:
df_ca_sample2 = df_ca_sample

In [69]:
df_ca_sample2.iat[0, 0] = "CA"
df_ca_sample2.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
20914,CA,244.05,16474,189.42,19085,188.6,1093,2014-12-28,2014,12,52,6
10051,California,243.6,16664,189.19,19416,188.6,1109,2015-01-14,2015,1,3,2
11428,California,244.75,15893,190.43,18113,188.6,1042,2014-11-15,2014,11,46,5
2503,California,246.53,13435,192.83,14590,191.62,866,2014-04-04,2014,4,14,4
11275,California,245.27,15108,191.6,16881,187.94,989,2014-08-15,2014,8,33,4


In [70]:
df_ca_sample.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
20914,CA,244.05,16474,189.42,19085,188.6,1093,2014-12-28,2014,12,52,6
10051,California,243.6,16664,189.19,19416,188.6,1109,2015-01-14,2015,1,3,2
11428,California,244.75,15893,190.43,18113,188.6,1042,2014-11-15,2014,11,46,5
2503,California,246.53,13435,192.83,14590,191.62,866,2014-04-04,2014,4,14,4
11275,California,245.27,15108,191.6,16881,187.94,989,2014-08-15,2014,8,33,4


Fix the issue

In [73]:
df_ca_sample3 = df_ca_sample2.copy()

In [74]:
df_ca_sample3.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
20914,CA,244.05,16474,189.42,19085,188.6,1093,2014-12-28,2014,12,52,6
10051,California,243.6,16664,189.19,19416,188.6,1109,2015-01-14,2015,1,3,2
11428,California,244.75,15893,190.43,18113,188.6,1042,2014-11-15,2014,11,46,5
2503,California,246.53,13435,192.83,14590,191.62,866,2014-04-04,2014,4,14,4
11275,California,245.27,15108,191.6,16881,187.94,989,2014-08-15,2014,8,33,4


In [75]:
df_ca_sample3.iat[0, 0] = "CALIFORNIA"
df_ca_sample3.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
20914,CALIFORNIA,244.05,16474,189.42,19085,188.6,1093,2014-12-28,2014,12,52,6
10051,California,243.6,16664,189.19,19416,188.6,1109,2015-01-14,2015,1,3,2
11428,California,244.75,15893,190.43,18113,188.6,1042,2014-11-15,2014,11,46,5
2503,California,246.53,13435,192.83,14590,191.62,866,2014-04-04,2014,4,14,4
11275,California,245.27,15108,191.6,16881,187.94,989,2014-08-15,2014,8,33,4


In [76]:
df_ca_sample2.head()

Unnamed: 0,State,HighQ,HighQN,MedQ,MedQN,LowQ,LowQN,date,year,month,week,weekday
20914,CA,244.05,16474,189.42,19085,188.6,1093,2014-12-28,2014,12,52,6
10051,California,243.6,16664,189.19,19416,188.6,1109,2015-01-14,2015,1,3,2
11428,California,244.75,15893,190.43,18113,188.6,1042,2014-11-15,2014,11,46,5
2503,California,246.53,13435,192.83,14590,191.62,866,2014-04-04,2014,4,14,4
11275,California,245.27,15108,191.6,16881,187.94,989,2014-08-15,2014,8,33,4


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=f5f90ba1-3290-463e-8fc6-44108f4fa21b' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>