# Views, Copies, and that annoying SettingWithCopyWarning
If you've spent any time in pandas at all, you've seen ```SettingWithCopyWarning```. If not, you will soon.

Just like any warning, it's best not to ignore it since it's there for a reason: it's a sign that something is wrong with your code that could cause issues for you. In my case, I usually get this warning when I'm knee deep in some analysis and don't want to spend too much time figuring out how to fix it.

I'm going to cover a few typical examples of when this warning shows up, why it shows up, and how to quickly fix the underlying issue.

First, let's make an example ```DataFrame```.  I'm using a handy Python package called [Faker](https://faker.readthedocs.io/en/stable/index.html) to create some test data. You may need to install it first, with ```pip```.

As a quick aside, Faker is a great way to build test data for unit tests, test databases, or examples.

In [1]:
%pip install Faker

You should consider upgrading via the '/Users/mcw/.pyenv/versions/3.8.6/envs/pandas/bin/python -m pip install --upgrade pip' command.[0m
Note: you may need to restart the kernel to use updated packages.


In [2]:
import datetime

import pandas as pd
import numpy as np

from faker import Faker
fake = Faker()

In [3]:
df = pd.DataFrame(
[
 [fake.first_name(),
  fake.last_name(),
  fake.date_of_birth(),
  fake.date_this_year(),
  fake.city(),
  fake.state_abbr(),
  fake.postalcode()]
     for _ in range(20)],
 columns = ['first_name', 'last_name', 'dob', 'lastupdate', 'city', 'state', 'zip'])

df.head(3)

Unnamed: 0,first_name,last_name,dob,lastupdate,city,state,zip
0,Matthew,Franco,1969-11-24,2021-02-03,South Amandafort,CA,26721
1,Beth,Miller,1909-12-04,2021-01-12,Petersonstad,AK,52881
2,Kevin,Kim,1952-01-26,2021-01-27,Yatestown,ME,81028


## How do we set data again?
First, let's just review the ways we can set data in a ```DataFrame```, using use the ```loc``` or ```iloc``` indexers. These are for label based or integer offset based indexing respectively. (See [this article](https://www.wrighters.io/indexing-and-selecting-in-pandas-part-1/) for more detail on the two methods)

The first argument in the indexer is for the row, the second is for the column (or columns), and if we assign to this expression, we will update the underlying ```DataFrame```.

Note that the index here is just a ```RangeIndex```, so the labels are numbers. Because of that, even though I'm passing in int values to ```loc```, this is looking up by label, not relative index. 

In [4]:
df.head(1)['zip']

0    26721
Name: zip, dtype: object

In [5]:
df.loc[0, 'zip'] = '60601'

In [6]:
df.head(1)['zip']

0    60601
Name: zip, dtype: object

In [7]:
df.loc[0, ['city', 'state']] = ['Chicago', 'IL']

In [8]:
df.head(1)

Unnamed: 0,first_name,last_name,dob,lastupdate,city,state,zip
0,Matthew,Franco,1969-11-24,2021-02-03,Chicago,IL,60601


Here's an example of an ```iloc``` update.

In [9]:
df.iloc[0, 0] = 'Josh'
df.head(1)

Unnamed: 0,first_name,last_name,dob,lastupdate,city,state,zip
0,Josh,Franco,1969-11-24,2021-02-03,Chicago,IL,60601


Now, you can also do updates with the array indexing operator, but this can look very confusing because remember that on a ```DataFrame```, you are selecting columns first. I'd recommend not doing this for this reason alone, but as you'll soon see, there are other issues that can arise.

In [10]:
df["first_name"][0] = 'Joshy'

In [11]:
df.head(1)

Unnamed: 0,first_name,last_name,dob,lastupdate,city,state,zip
0,Joshy,Franco,1969-11-24,2021-02-03,Chicago,IL,60601


## When do we see this warning?
OK, now that we have updated our ```DataFrame``` successfully, it's time to see an example of where things can go wrong. For me, it's very typical to select a subset of the original data to work with. For example, let's say that we decide to only work with data where the person was born before 2000.

In [12]:
dob_limit = datetime.date(2000, 1, 1)
sub = df[df['dob'] < dob_limit]
sub.shape

(14, 7)

In [13]:
idx = sub.head(1).index[0]  # save the location for update attempts below
sub.head(1)

Unnamed: 0,first_name,last_name,dob,lastupdate,city,state,zip
0,Joshy,Franco,1969-11-24,2021-02-03,Chicago,IL,60601


Let's try to update the ```lastupdate``` column.

In [14]:
sub.loc[idx, 'lastupdate'] = datetime.date.today()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  sub.loc[idx, 'lastupdate'] = datetime.date.today()


Boom! There it is, we are told we are trying to set values on a copy of a slice from a ```DataFrame```. What ended up happening here?  Well, ```sub``` *was* updated, but ```df``` *wasn't*.

In [15]:
sub.loc[idx, 'lastupdate']

datetime.date(2021, 2, 4)

In [16]:
df.loc[idx, 'lastupdate']

datetime.date(2021, 2, 3)

Pandas is warning you that you might have not done what you expected. When you created ```sub```, you ended up with a copy of the data in ```df```. When you updated the value, you're now being warned that you only updated the copy, not the original.

## So how should this be fixed?
There are two primary ways to address this, and which one you choose depends on what you are trying to accomplish in your code. 

### Update the original
If your intention is to update your original data, you just need to update it directly. So instead of doing your update on ```sub```, do it on ```df``` instead.

In [17]:
df.loc[idx, 'lastupdate'] = datetime.date.today()

In [18]:
df.loc[idx, 'lastupdate']

datetime.date(2021, 2, 4)

Now note that when you do this, since your view is a copy, it won't be updated. If you want both ```sub``` and ```df``` to match, you need to either update both or recreate ```sub``` after the update. Because of this, it's important for you to pause and think any time you update a ```DataFrame```. Have you created views of this data that now need to be refreshed?

### Update the copy
If your goal is to update the copy of the data only, to eliminate the warning, tell pandas you want that view to always be a copy.

In [19]:
sub2 = df[df['dob'] < dob_limit].copy()
sub2.loc[idx, 'lastupdate'] = datetime.date.today()
sub2.loc[idx, 'lastupdate']

datetime.date(2021, 2, 4)

### In between
One common situation that happens is an initial full sized ```DataFrame``` is narrowed down to a much smaller one by filtering the data. Maybe new columns are added as part of some calculations, and then as a final result, the original ```DataFrame``` should be updated. One way to do that is to use the index to help you out.

In [20]:
sub3 = df[df['dob'] < dob_limit].copy()                                          # we'll be updating this DataFrame
sub3['manualupdate'] = datetime.date.today() - datetime.timedelta(days=10)       # you can modify this DataFrame
sub3 = sub3.head(3)                                                              # or even make it smaller
sub3['manualupdate']

0    2021-01-25
1    2021-01-25
2    2021-01-25
Name: manualupdate, dtype: object

Now, we'll use the fact that ```sub3``` shares an index with the original ```df``` to use it to update the data. We can update all matching row of column ```lastupdate``` for example.

In [21]:
df.loc[sub3.index, 'lastupdate'] = sub3['manualupdate']
df.loc[sub3.index]

Unnamed: 0,first_name,last_name,dob,lastupdate,city,state,zip
0,Joshy,Franco,1969-11-24,2021-01-25,Chicago,IL,60601
1,Beth,Miller,1909-12-04,2021-01-25,Petersonstad,AK,52881
2,Kevin,Kim,1952-01-26,2021-01-25,Yatestown,ME,81028


Now, you can see that those rows have been updated from our smaller subset of data. 

## Subsets of columns
You also may encounter this warning when working with subsets of columns in a ```DataFrame```.

In [22]:
df_d = df[['zip']]
df_d.loc[idx, 'zip'] = "00313" # SettingWithCopyWarning 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy


A great way to suppress the warning here is to do a full slice with ```loc``` in your initial selection. You can also use ```copy```.

In [23]:
df_d = df.loc[:, ['zip']]

df_d.loc[idx, 'zip'] = "00313"

### For completeness, some more details
Now you can read about this warning in many [other](https://www.dataquest.io/blog/settingwithcopywarning/) [places](https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas), and if you've come here through a search engine maybe you've already found them either confusing or not directly applicable to your situation. I took a slightly different approach above to show the situation where I usually see this error. However, a more common reason new pandas users encounter this error is when trying to update their ```DataFrame``` using the array index operator (```[]```).

In [24]:
df[df['dob'] < dob_limit]['lastupdate'] = datetime.date.today()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df[df['dob'] < dob_limit]['lastupdate'] = datetime.date.today()


The fix here is pretty straightforward, use ```loc```. Let's give that a try.

In [25]:
df.loc[df['dob'] < dob_limit, 'lastupdate'] = datetime.date.today() - datetime.timedelta(days=1)
df.loc[df['dob'] < dob_limit].head(1)

Unnamed: 0,first_name,last_name,dob,lastupdate,city,state,zip
0,Joshy,Franco,1969-11-24,2021-02-03,Chicago,IL,60601


That works. The warning here was telling us that our first update is (potentially) operating on a copy of our original data. I don't think this is quite as obvious as our opening case because pandas has some complicated reasons for choosing to sometimes return a copy and sometimes return a view into the original data, and this may not seem obvious when the update is on one line. When it can detect that this is happening, it raises this warning.

This is called chained assignment. The assignment above with the warning is really doing this:

In [26]:
df.__getitem__(df.__getitem__('dob') < dob_limit).__setitem__('lastupdate', datetime.date.today())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.__getitem__(df.__getitem__('dob') < dob_limit).__setitem__('lastupdate', datetime.date.today())


When you use the array index operator, the ```__getitem__``` and ```__setitem__``` methods are invoked for getting and setting respectively. That first function call to ```__getitem__``` is returning a copy of the data, then attempting to set data on it, triggering the warning.


If we use ```loc```, though, it will be doing this, without returning a temporary view.

In [27]:
df.loc.__setitem__((df.__getitem__('dob') < dob_limit, 'lastupdate'), datetime.date.today())

So whenever you see this warning, just look at your code and check two things. Did you try to update the data using ```[]```? If so, switch to ```loc``` (or ```iloc```). If you're doing that and it's still complaining, it's because your ```DataFrame``` was created from another ```DataFrame```. Either make a full copy if you plant to update it, or update your original ```DataFrame``` instead.