# PyData Cardiff Workshop 3 - Introduction to Pandas

![title](images/pydata_cardiff.jpg)

## Introduction to the library

Pandas is a seminal python library, which has revolutionised data analytics for the programming language. It began development in 2008 by Wes McKinney when he was working at AQR Capital Management. Initially, it was a purely in-house project, but on leaving his position, Wes was able to convince AQR to permit him to open-source the code.

If anyone is interested - the name Pandas stands for PANel Data ANalysis

Note that the usual way to import this library is to use the pattern `import pandas as pd`

In [1]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:90% !important; }</style>"))

In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

## Introducing the basic data types

### The Pandas Series

A one dimensional array of information. It has similarities with a numpy array - and it can be useful to think of a series like a column of information in an Excel Spreadsheet. Similarly to the numpy array - all of the data types in a series _should_ be of the same data type.

#### Creating a simple Series - very similar to a `numpy array`

Note the present of a single integer at the end - but this will be coerced to a float.

In [3]:
ar = np.array([0.2, 1.2, 3.4, 5.6, 3.8, 6.7, 1.2, 7])
ser = pd.Series([0.2, 1.2, 3.4, 5.6, 3.8, 6.7, 1.2, 7])

In [4]:
ar.dtype

dtype('float64')

In [5]:
ser.dtype

dtype('float64')

## Note how the series deals with Mixed types

It states that they are of type `'O'` - meaning a Python object!

In [6]:
object_ser = pd.Series([1, 'hello', None, 3.4])

In [7]:
object_ser

0        1
1    hello
2     None
3      3.4
dtype: object

In [8]:
object_ser.dtype

dtype('O')

### Similar methods and functionality

There are a series of methods of the Series that share the same functionality with numpy arrays. There are called the numpy 'universal' functions `ufunc`

In [9]:
ar.mean()

3.6375

In [10]:
ser.mean()

3.6375

In [11]:
ar.sum()

29.1

In [12]:
ser.sum()

29.1

### However!

There will be some different behaviours seen! Note the different ways in which the variance is calculated.

In numpy - this is calculated as:

$$\frac{\Sigma (x - \bar{x})^{2}}{n}$$

In [13]:
ar.var()

6.039843749999999

But in the Series - this is calculated as the _unbiased_ variance, using a method called _Bessel's Correction_ by subtracting 1 from _n_

$$\frac{\Sigma (x - \bar{x})^{2}}{n - 1}$$

The effect that this has is a larger value for variance. In statistics - this has useful implications by making the variance of distributions wider, and statistical testing more rigorous.

This value can be set by chaning the _delta degrees of freedom_ argument `ddof`

In [14]:
ser.var()

6.902678571428571

In [15]:
ser.var(ddof=1)

6.902678571428571

In [16]:
ser.var(ddof=0)

6.039843749999999

## The Series Index

This is a key feature of the Series when compared with the array - and can be thought of as the name that the a row would be given if the Series was a column in a SpreadSheet.

This can be seen when we simply view the object - note that as we did not set this, the default value is the number of the row - indexed from 0

In [17]:
ar

array([0.2, 1.2, 3.4, 5.6, 3.8, 6.7, 1.2, 7. ])

In [18]:
ser

0    0.2
1    1.2
2    3.4
3    5.6
4    3.8
5    6.7
6    1.2
7    7.0
dtype: float64

In [19]:
ser.index

RangeIndex(start=0, stop=8, step=1)

This can be set at the creation of the variable - and note that we can use the values of the previous series

In [20]:
ser2 = pd.Series(data=ser.values, index = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'])

In [21]:
ser2

a    0.2
b    1.2
c    3.4
d    5.6
e    3.8
f    6.7
g    1.2
h    7.0
dtype: float64

We can always call the series with `.values` to get the information as a numpy array

In [22]:
isinstance(ser.values, np.ndarray)

True

In this way - the series can be interacted with in a similar fashion to a dictionary

In [23]:
di = {'a': 0.2, 'g': 1.2, 'h': 7., 'b': 1.2, 'c': 3.4, 'f': 6.7, 'd': 5.6, 'e': 3.8}

In [24]:
ser[2]

3.4

In [25]:
ser2['c']

3.4

In [26]:
di['c']

3.4

However - note that there is an additional slicing ability that is not present in dictionaries

__BUT__ - take care to notice that this slicing in Pandas is __inclusive__ of the end point!!!!

In [27]:
ser2['c': 'f']

c    3.4
d    5.6
e    3.8
f    6.7
dtype: float64

In [28]:
di['c': 'f']

TypeError: unhashable type: 'slice'

## Additional functionality in the Series

A good example of this are the functions `rolling` and `expanding`. These create a type of _Window_ function - either sliding or expanding.

Note the presence of the missing values when calling these functions. In this case, the first two values are first used to calculate the third value.

In [29]:
ser2.expanding(3).mean()

a         NaN
b         NaN
c    1.600000
d    2.600000
e    2.840000
f    3.483333
g    3.157143
h    3.637500
dtype: float64

In [30]:
ser2.rolling(3).mean()

a         NaN
b         NaN
c    1.600000
d    3.400000
e    4.266667
f    5.366667
g    3.900000
h    4.966667
dtype: float64

There is also functionality to shift the data by position

In [31]:
ser2.shift(1)

a    NaN
b    0.2
c    1.2
d    3.4
e    5.6
f    3.8
g    6.7
h    1.2
dtype: float64

In [32]:
ser2.shift(-3)

a    5.6
b    3.8
c    6.7
d    1.2
e    7.0
f    NaN
g    NaN
h    NaN
dtype: float64

## Missing values - differences between numpy and pandas

One feature of numpy arrays is that the presence of missing values can have a detrimental effect when performing any `func`

Note that we __must__ use the `np.nan` (not a number) variable to create the missing value - `None` will not work

In [33]:
ar_missing = np.array([1, 2, 3, 4, np.nan, 5])

In [35]:
ar_missing

array([ 1.,  2.,  3.,  4., nan,  5.])

In [34]:
ar_missing.sum()

nan

In [36]:
ar_missing.mean()

nan

Slight difference here:

In [37]:
ar_missing.cumsum()

array([ 1.,  3.,  6., 10., nan, nan])

This has to be dealt with using the specialised functions

In [38]:
np.nansum(ar_missing)

15.0

In [39]:
np.nanmean(ar_missing)

3.0

In [40]:
np.nancumsum(ar_missing)

array([ 1.,  3.,  6., 10., 10., 15.])

In Pandas - these function __are the default!__

Also - note that we can create a missing value using `None` - it will get changed to a `NaN` automatically

In [41]:
ser_missing = pd.Series([1, 2, 3, 4, None, 5])

In [42]:
ser_missing

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    5.0
dtype: float64

In [43]:
ser_missing.sum()

15.0

In [44]:
ser_missing.mean()

3.0

This one is slightly different!

In [45]:
ser_missing.cumsum()

0     1.0
1     3.0
2     6.0
3    10.0
4     NaN
5    15.0
dtype: float64

## Dealing with missing values

Missing values are a common feature of using real datasets. 4 examples of how to deal with these are shown here.

1. Replacing the missing value with a stated replacement
2. Backfilling the data from later/lower
3. Forward filling the data from earlier/higher
4. Just drop them entirely!

In [46]:
ser_missing_start = ser2.shift(3)

In [47]:
ser_missing_start

a    NaN
b    NaN
c    NaN
d    0.2
e    1.2
f    3.4
g    5.6
h    3.8
dtype: float64

In [48]:
ser_missing_start.fillna(-999)

a   -999.0
b   -999.0
c   -999.0
d      0.2
e      1.2
f      3.4
g      5.6
h      3.8
dtype: float64

Remember that missing numbers won't affect the mean calculation in Pandas

In [49]:
ser_missing_start.fillna(ser_missing_start.mean())

a    2.84
b    2.84
c    2.84
d    0.20
e    1.20
f    3.40
g    5.60
h    3.80
dtype: float64

In [50]:
ser_missing_start.bfill()

a    0.2
b    0.2
c    0.2
d    0.2
e    1.2
f    3.4
g    5.6
h    3.8
dtype: float64

In [51]:
ser_missing_end = ser2.shift(-3)

In [52]:
ser_missing_end

a    5.6
b    3.8
c    6.7
d    1.2
e    7.0
f    NaN
g    NaN
h    NaN
dtype: float64

In [53]:
ser_missing_end.ffill()

a    5.6
b    3.8
c    6.7
d    1.2
e    7.0
f    7.0
g    7.0
h    7.0
dtype: float64

In [54]:
ser_missing_end.dropna()

a    5.6
b    3.8
c    6.7
d    1.2
e    7.0
dtype: float64

# Moving to the DataFrame

This is really the main datatype in Pandas. Think of one as a collection of Series objects - all sharing the same index.

A dataframe can be created using a variety of methods - only a few of which will be shown here.

Using a dictionary. However in order to maintain the desired column order - we will be using an `OrderedDict` here

In [55]:
from collections import OrderedDict

In [56]:
data1 = OrderedDict({
    'col1': [1, 2, 3, 4],
    'col2': [4, 5, 6, 7]
})

In [57]:
df1 = pd.DataFrame(data1)

In [58]:
df1

Unnamed: 0,col1,col2
0,1,4
1,2,5
2,3,6
3,4,7


Using a numpy array, with column information

In [59]:
data2 = np.array([
    [1, 4],
    [2, 5],
    [3, 6],
    [4, 7]
])

In [60]:
data2

array([[1, 4],
       [2, 5],
       [3, 6],
       [4, 7]])

In [61]:
df2 = pd.DataFrame(data2, columns=['col1', 'col2'])

In [62]:
df2

Unnamed: 0,col1,col2
0,1,4
1,2,5
2,3,6
3,4,7


In a similar fashion to a series - we can use `.values` to get the data as a numpy array

In [63]:
df2.values

array([[1, 4],
       [2, 5],
       [3, 6],
       [4, 7]])

The index can also be set at creation

In [64]:
df3 = pd.DataFrame(data1, index=['a', 'b', 'c', 'd'])

In [65]:
df3

Unnamed: 0,col1,col2
a,1,4
b,2,5
c,3,6
d,4,7


Note that the columns and index __must__ be of the correct length, or you will get an error!

In [66]:
# err = pd.DataFrame(data1, index=['a', 'b', 'c'])

## Adding and selecting data

If we wish to add a column of information to the dataframe, we can use dictionary-like `[]`, just as long as the length of the value being assigned is of the correct length.

In [67]:
df3['col3'] = [4, 3, 2, 1]
df3['col4'] = [101, 102, 103, 104]
df3['col5'] = [-1, -2, -3, -4]

In [68]:
df3

Unnamed: 0,col1,col2,col3,col4,col5
a,1,4,4,101,-1
b,2,5,3,102,-2
c,3,6,2,103,-3
d,4,7,1,104,-4


We can also use the `[]` notation to obtain a single series back from the dataframe, using the column name

In [69]:
col2_series = df3['col2']

In [70]:
col2_series

a    4
b    5
c    6
d    7
Name: col2, dtype: int64

In [71]:
type(col2_series)

pandas.core.series.Series

## Using double brackets - `[[]]`

A very important feature to learn is that, while the `[]` notation returned a series, if we use double square brackets, then we do not get a series... but a __dataframe__

In [72]:
col2_df = df3[['col2']]

In [73]:
col2_df

Unnamed: 0,col2
a,4
b,5
c,6
d,7


In [74]:
type(col2_df)

pandas.core.frame.DataFrame

As dataframes do not need to be 1D - we can use this method to select multiple columns

In [75]:
df3[['col1', 'col3']]

Unnamed: 0,col1,col3
a,1,4
b,2,3
c,3,2
d,4,1


## Using conditional statements to select information

* This can include either single - or multiple statements
* But note the syntax for how multiple statements are used

In [76]:
df3[df3['col1'] > 1]

Unnamed: 0,col1,col2,col3,col4,col5
b,2,5,3,102,-2
c,3,6,2,103,-3
d,4,7,1,104,-4


In [77]:
df3[(df3['col1'] > 1) & (df3['col2'] >= 6)]

Unnamed: 0,col1,col2,col3,col4,col5
c,3,6,2,103,-3
d,4,7,1,104,-4


## Using the `.iloc` and `.loc` notation

This is often the preferred method of selecting data. It can seem a little strange - but this will hopefully break it down

* We use `loc` for using identifiers present in the index
* We use `iloc` when getting the numbers of the rows - indexed from 0
    * Of course - if the index is the default of row numbers - then this will be the same!
* The earlier feature of `[]` for series and `[[]]` still holds!

Of note - you will sometime see the func `ix` used in some older text - this has now been deprecated

In [78]:
df3.loc['a']

col1      1
col2      4
col3      4
col4    101
col5     -1
Name: a, dtype: int64

In [79]:
df3.iloc[0]

col1      1
col2      4
col3      4
col4    101
col5     -1
Name: a, dtype: int64

In [80]:
df3.loc[['a']]

Unnamed: 0,col1,col2,col3,col4,col5
a,1,4,4,101,-1


In [81]:
df3.iloc[[0]]

Unnamed: 0,col1,col2,col3,col4,col5
a,1,4,4,101,-1


## Selecting column  as well!

Note that this will return the value that appears in a particular cell

In [82]:
df3.loc['a', 'col3']

4

## Slicing

Using this method - we can use slicing for both rows and columns

In [83]:
df3.loc['a': 'd', 'col2': 'col4']

Unnamed: 0,col2,col3,col4
a,4,4,101
b,5,3,102
c,6,2,103
d,7,1,104


# The `SettingWithCopy` warning!

This will soon become the bane of your life when working with Pandas dataframes!

Here I will try to explain it as best as I can!

In [84]:
df4 = df3.copy()

In [85]:
df4

Unnamed: 0,col1,col2,col3,col4,col5
a,1,4,4,101,-1
b,2,5,3,102,-2
c,3,6,2,103,-3
d,4,7,1,104,-4


In [86]:
df4['a': 'c']['col2']

a    4
b    5
c    6
Name: col2, dtype: int64

In [87]:
df4['a': 'c']['col2'] = [32, 31, 30]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


It has still worked though!

In [88]:
df4

Unnamed: 0,col1,col2,col3,col4,col5
a,1,32,4,101,-1
b,2,31,3,102,-2
c,3,30,2,103,-3
d,4,7,1,104,-4


In [89]:
df4 = df3.copy()

Using `loc`, we avoid this error!

In [90]:
df4.loc['a': 'c', 'col2'] = [32, 31, 30]

In [91]:
df4

Unnamed: 0,col1,col2,col3,col4,col5
a,1,32,4,101,-1
b,2,31,3,102,-2
c,3,30,2,103,-3
d,4,7,1,104,-4


### Now - this seems to work here

In [92]:
df5 = df4.loc['a': 'c', ['col1', 'col3', 'col5']]

In [93]:
df5

Unnamed: 0,col1,col3,col5
a,1,4,-1
b,2,3,-2
c,3,2,-3


In [94]:
df5.loc['a', 'col3'] = 9999

In [95]:
df6 = df4.loc['b': 'd', :]

In [96]:
df6.loc['a', 'col3'] = 9999

### Just observe how irritating this is!

This really looks the same to me!

In [97]:
warning_data = {'one': np.arange(1, 11), 'two': np.arange(11, 21)}   

In [98]:
warning_df = pd.DataFrame(warning_data)

In [99]:
warning_df

Unnamed: 0,one,two
0,1,11
1,2,12
2,3,13
3,4,14
4,5,15
5,6,16
6,7,17
7,8,18
8,9,19
9,10,20


In [100]:
warning_df2 = warning_df.loc[3:5, :] 

In [101]:
warning_df2

Unnamed: 0,one,two
3,4,14
4,5,15
5,6,16


In [102]:
warning_df2.loc[4, 'one'] = 99 

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self._setitem_with_indexer(indexer, value)
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """Entry point for launching an IPython kernel.


In [103]:
warning_df

Unnamed: 0,one,two
0,1,11
1,2,12
2,3,13
3,4,14
4,99,15
5,6,16
6,7,17
7,8,18
8,9,19
9,10,20


### Just make a copy!

In [104]:
no_warning_df = pd.DataFrame(warning_data)

In [105]:
no_warning_df2 = no_warning_df.loc[3:5, :].copy()

In [106]:
no_warning_df2.loc[4, 'one'] = 99 

In [107]:
no_warning_df2

Unnamed: 0,one,two
3,4,14
4,99,15
5,6,16


In [108]:
no_warning_df

Unnamed: 0,one,two
0,1,11
1,2,12
2,3,13
3,4,14
4,5,15
5,6,16
6,7,17
7,8,18
8,9,19
9,10,20


#### This is admittedly confusing! For a more detailed explanation - see [this blog](https://www.dataquest.io/blog/settingwithcopywarning/)

# Loading in data

This is probably the most important part of the workshop, as it will be one of the most common processes that you will __always__ do when carrying out data analysis. For this, we will look at loading in data from both a comma-separated-value file `.csv` and Excel files (other methods can include reading in streaming data - or information from relational databases). The format that you will probably be working with most is `.csv`. This is done using the following methods:

* `pd.read_csv()`
* `pd.read_excel()`
    * Note that to use this - you must install the `xlrd` library to read
    * And the `openpyxl` (together with its dependencies) to write data (use `pip` or `conda`)
        * But we won't be using that here!

This quickly can get more complicated that it initially sounds - a quick look at the documentation for these functions shows that! This is because of all of the potential problems that have to be considered when _parsing_ data from an external source. We do not have time to cover all of these, but a few of the features will be explained.

## Loading data from a _clean_ `.csv` file

* Note that this dataset does not have any index information - so one will be made with the row numbers indexed from 0
* Also - the file does not _have_ to separated by commas - if any other punctuation is use (like `;`), then this can be specified with the `delimiter` or `sep` argument (they do exactly the same thing - violation of the Zen of Python!)
    * If you know that your columns are segregated by spaces - or any other form of whitespace - then use the `delim_whitespace = True` in the function call

In [119]:
!head -n 5 data/iris.csv

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa


In [109]:
iris_csv = pd.read_csv('data/iris.csv')

# This is just the same as:
# iris_csv = pd.read_csv('data', delimiter=',')

### Examining the data with `.head()` and `.tail()`

Probably the most used function that you will ever learn in Pandas is `head()`, which allows us to see the first 5 rows of data by default - but this number can be changed.

`tail()` has similar functionality - but shows the end of the dataframe rather than the top

In [110]:
iris_csv.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


In [111]:
iris_csv.head(10)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa
9,4.9,3.1,1.5,0.1,setosa


In [112]:
iris_csv.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
145,6.7,3.0,5.2,2.3,virginica
146,6.3,2.5,5.0,1.9,virginica
147,6.5,3.0,5.2,2.0,virginica
148,6.2,3.4,5.4,2.3,virginica
149,5.9,3.0,5.1,1.8,virginica


## Reading in from Excel

Here - the syntax is very similar, but note that as the file in question has multiple sheets - we can specify the sheet name of interest

In [113]:
diamonds = pd.read_excel('data/iris_and_diamonds.xlsx', sheet_name='diamonds')

In [114]:
diamonds.head()

Unnamed: 0,carat,cut,color,clarity,depth,table,price,x,y,z
0,0.23,Ideal,E,SI2,61.5,55.0,326,3.95,3.98,2.43
1,0.21,Premium,E,SI1,59.8,61.0,326,3.89,3.84,2.31
2,0.23,Good,E,VS1,56.9,65.0,327,4.05,4.07,2.31
3,0.29,Premium,I,VS2,62.4,58.0,334,4.2,4.23,2.63
4,0.31,Good,J,SI2,63.3,58.0,335,4.34,4.35,2.75


## Loading in some _problematic_ data!

This file has 2 lines of junk information at the top of it - you will sometimes get it when downloading from certain sites - as they like to put it in there for identification purposes - and to make our work more interesting/unbelievably-irritating!

In [120]:
!head -n 5 data/iris_problem.csv

Here are 2 lines
of problem information!
sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa


In [121]:
iris_csv2 = pd.read_csv('data/iris_problem.csv')

ParserError: Error tokenizing data. C error: Expected 1 fields in line 3, saw 5


In [122]:
iris_csv2 = pd.read_csv('data/iris_problem.csv', skiprows=2)

In reality though - we could probably just delete this in the text file before we load it in!

In [None]:
import seaborn as sns

In [None]:
iris = sns.load_dataset('iris')

In [None]:
iris.head()

## Diamonds are good for the multi level group bys

In [None]:
diamonds = sns.load_dataset('diamonds')

In [None]:
diamonds.head()

In [None]:
test = diamonds.groupby(['cut', 'color', 'clarity']).mean()

In [None]:
test.head(30)

In [None]:
idx = pd.IndexSlice

In [None]:
test.loc[idx[:, ['D', 'E'], ['I1', 'SI1']], ['depth', 'price']].reset_index()

In [None]:
df = pd.DataFrame(np.random.randn(10, 4),
   index = pd.date_range('1/1/2000', periods=10),
   columns = ['A', 'B', 'C', 'D'])

In [None]:
df

In [None]:
df.expanding(min_periods=3).mean()

In [None]:
df.rolling(3).mean()

In [None]:
diamonds

In [None]:
diamonds.sort_values('price').groupby('cut')['price'].transform(lambda x: x.mean())

## Financial data with date time index

In [None]:
tsla = pd.read_csv('data/TSLA.csv', index_col='Date')

In [None]:
tsla.info()

In [None]:
tsla = pd.read_csv('data/TSLA.csv', index_col='Date', parse_dates=True)

In [None]:
tsla.info()

In [None]:
tsla['Adj Close'].plot(figsize=(16, 8))

In [None]:
tsla[['Open', 'Close']].plot(figsize=(16, 8))

In [None]:
tsla['year'] = tsla.index.year

In [None]:
tsla['month'] = tsla.index.month

In [None]:
tsla['ACMS'] = tsla.sort_index().groupby(['year', 'month'])['Adj Close'].transform('first')

In [None]:
tsla['MN'] = np.log(tsla['Adj Close'] / tsla['ACMS'])

In [None]:
tsla['MN'].plot(figsize=(16, 8))

In [None]:
tsla = pd.read_csv('data/TSLA.csv', index_col='Date', parse_dates=True)

In [None]:
tsla_new = (
    tsla
    .assign(year = lambda df: df.index.year)
    .assign(month = lambda df: df.index.month)
    .assign(start_of_month_close = lambda df: df.groupby(['year', 'month'])['Adj Close'].transform('first'))
)

In [None]:
tsla_new[['start_of_month_close', 'Adj Close']].plot(figsize=(16, 8))

In [None]:
tsla_new = (
    tsla
    .assign(**{'year': lambda df: df.index.year, 'month': lambda df: df.index.month})
    .assign(**{'start_of_month_close': lambda df: df.groupby(['year', 'month'])['Adj Close'].transform('first')})
)

In [None]:
tsla_new

In [None]:
data = pd.Series([1, np.nan, 'hello', None])

In [None]:
data

In [None]:
data.dtype