In [2]:
import pandas as pd
import numpy as np

# Intro to Data Structures

First off, a lot of the materials presented here come from the pandas [Intro to Data Structures Tutorial](https://pandas.pydata.org/pandas-docs/stable/dsintro.html). That being said I hope to add some color on what is important to learn from pandas by telling you the idioms that I use on a day to day basis as a data scientist. So let's start off.

## Series

Pandas has two data structures a Series and a Dataframe. A series is like a column in excel, basically a list of datapoints all of the same type. And the basic way to create a series object is below:

In [3]:
pd.Series?

In [4]:
s = pd.Series(
        np.random.randn(5), 
        index=['a', 'b', 'c', 'd', 'e'], 
        name='example')

s

a   -1.066690
b   -0.640083
c    0.253090
d    0.615389
e   -0.053631
Name: example, dtype: float64

There are other ways to make a series (like from a dictionary), but in general this is the only one that I ever use. So notice that a series has basically three important parts:

1. The data
2. The index 
3. The name

The data can be a list of data, or a single instance that broadcasts, like below:

In [5]:
pd.Series(5, index=['a', 'b', 'c', 'd', 'e'])

a    5
b    5
c    5
d    5
e    5
dtype: int64

(Broadcasting as we will see later on is really important).

The data is basically what you as a data scientist are interested in. The index is often used in time series, but otherwise I really don't use the index for series (now I do use the index for dataframes quite a lot!). But notice that each datapoint is associated with an index.

Finally the name. The name is only really important when you add a series to a dataframe. In that case the name of the series becomes the column. 

You have so far not seen why series are all that useful, but now we start to get into it. Series have various ways that you can index into them:

In [6]:
s[0]

-1.0666901552914934

In [7]:
s[:3]

a   -1.066690
b   -0.640083
c    0.253090
Name: example, dtype: float64

In [8]:
s[[4, 3, 1]]

e   -0.053631
d    0.615389
b   -0.640083
Name: example, dtype: float64

In [9]:
s.values

array([-1.06669016, -0.64008336,  0.25308996,  0.61538851, -0.05363091])

In [10]:
s['e'] = 500
s

a     -1.066690
b     -0.640083
c      0.253090
d      0.615389
e    500.000000
Name: example, dtype: float64

Generally Speaking I don't do any of the above operations - and if you find yourself using them definitely give some thought on whether you should be using pandas for those operations or whether you should be using Numpy.

Now let me show you some operations that I frequently use:

In [11]:
s[[True, True, False, False, True]]

a     -1.066690
b     -0.640083
e    500.000000
Name: example, dtype: float64

In [12]:
# or the extremely common
s[s > 0], s > 0

(c      0.253090
 d      0.615389
 e    500.000000
 Name: example, dtype: float64, a    False
 b    False
 c     True
 d     True
 e     True
 Name: example, dtype: bool)

In [13]:
# and you can mutate the data too
# you'll just need to be careful with this!
s[s < 0] *= -1

In [14]:
s

a      1.066690
b      0.640083
c      0.253090
d      0.615389
e    500.000000
Name: example, dtype: float64

But one thing that is super useful about series is that you can do vectorized operations (fast computations on everything in the entire series) on them. And you have already seen one. 

In [15]:
s > 0

a    True
b    True
c    True
d    True
e    True
Name: example, dtype: bool

In [16]:
s + s

a       2.133380
b       1.280167
c       0.506180
d       1.230777
e    1000.000000
Name: example, dtype: float64

In [17]:
np.exp(s)

a     2.905746e+00
b     1.896639e+00
c     1.287999e+00
d     1.850375e+00
e    1.403592e+217
Name: example, dtype: float64

In [18]:
s.mean()

100.51505039803234

In [19]:
# just be careful with some operations
# if the indexes don't match up you will get nans
s + s[s > 0]

a       2.133380
b       1.280167
c       0.506180
d       1.230777
e    1000.000000
Name: example, dtype: float64

These types of operations that are over columns is what pandas is made for. Any time you stray from doing operations over columns, you should think to yourself: is pandas the best tool for me?

Now doing operations over one column might seem useful, but what about operations over multiple columns.

## DataFrames

Series are nice, but the really nice thing about them is dataframes. Dataframes are like an entire excel spreadsheet! As you can probably guess, dataframes are a list of series, each one with a name and the same index. Thus an easy way to create a dataframe is to create it with a dictionary of series/lists:

In [20]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
    'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

df = pd.DataFrame(d)

df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [21]:
# often I actually just pass in np arrays and scalars

d = {'one' : 'Hellow',
    'two' : np.array([1., 2., 3., 4.])}

df = pd.DataFrame(d)
df

Unnamed: 0,one,two
0,Hellow,1.0
1,Hellow,2.0
2,Hellow,3.0
3,Hellow,4.0


There are plenty of ways to create one of these things, but generally speaking just knowing one is enough. You can always change the values of the index or the columns later:

In [22]:
df.columns = ['1', '2']
df.index = ['a', 'b', 'c', 'd']
df

Unnamed: 0,1,2
a,Hellow,1.0
b,Hellow,2.0
c,Hellow,3.0
d,Hellow,4.0


Dataframes are basically just dictionaries of columns/series, so you can use most of the same techniques you used for series on dataframes themselves. 

The general way to ref a column is below:

In [23]:
d = {'one' : 'Hellow',
    'two' : np.array([1., 2., 3., 4.])}

df = pd.DataFrame(d)
df.index = ['a', 'b', 'c', 'd']

# gives you back a named series
df['one']

a    Hellow
b    Hellow
c    Hellow
d    Hellow
Name: one, dtype: object

You can then do anything with the series that we did above, nifty.

There is a dot notation shortcut, but it is almost better not to know it because it can lead to errors if not used correctly!

You can of course delete and make new columns, with broadcasting as well

In [24]:
del df['one']

In [25]:
df['three'] = df['two'] + df['two']
df['four'] = 'four'
df['five'] = df['four'][:2]

In [26]:
df

Unnamed: 0,two,three,four,five
a,1.0,2.0,four,four
b,2.0,4.0,four,four
c,3.0,6.0,four,
d,4.0,8.0,four,


Again there are other ways of inserting columns (insert and assign methods) but I never use them. The benefits of using other methods also seems pretty small.

Next let's go over indexing and selecting with dataframes. There are basically 4 ways to do so:

In [27]:
# get a column
df['two']

a    1.0
b    2.0
c    3.0
d    4.0
Name: two, dtype: float64

In [28]:
# or more
df[['five', 'two']]

Unnamed: 0,five,two
a,four,1.0
b,four,2.0
c,,3.0
d,,4.0


In [29]:
# select by indexes and column names
df.loc['a', 'two']

1.0

In [30]:
df.loc['d':'a':-1, 'two':'three']

Unnamed: 0,two,three
d,4.0,8.0
c,3.0,6.0
b,2.0,4.0
a,1.0,2.0


In [31]:
# select rows and columns by their ordering
df.iloc[1:3, 0]

b    2.0
c    3.0
Name: two, dtype: float64

In [32]:
df.iloc[1:3]

Unnamed: 0,two,three,four,five
b,2.0,4.0,four,four
c,3.0,6.0,four,


## DataFrame Functions

In addition to doing these columnwise operations, you can also do some dataframewise operations. 

The most useful of these is the copy method, it makes a copy :)

In [33]:
df.copy()

Unnamed: 0,two,three,four,five
a,1.0,2.0,four,four
b,2.0,4.0,four,four
c,3.0,6.0,four,
d,4.0,8.0,four,


The astype method converts the types of columns =

In [42]:
df.two.astype(np.int)

a    1
b    2
c    3
d    4
Name: two, dtype: int64

The next thing that I very commonly use is the dataframe transpose ability:

In [34]:
df.T

Unnamed: 0,a,b,c,d
two,1,2,3,4
three,2,4,6,8
four,four,four,four,four
five,four,four,,


This puts the rows as the columns and the columns as the rows. It can be a good way to do row-wise operations, but mainly I do it to display dataframe values. Below are the three common ways to display dataframe values:

In [35]:
df.head(2)

Unnamed: 0,two,three,four,five
a,1.0,2.0,four,four
b,2.0,4.0,four,four


In [36]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 4 columns):
two      4 non-null float64
three    4 non-null float64
four     4 non-null object
five     2 non-null object
dtypes: float64(2), object(2)
memory usage: 320.0+ bytes


In [38]:
df.describe(include='all')

Unnamed: 0,two,three,four,five
count,4.0,4.0,4,2
unique,,,1,1
top,,,four,four
freq,,,4,2
mean,2.5,5.0,,
std,1.290994,2.581989,,
min,1.0,2.0,,
25%,1.75,3.5,,
50%,2.5,5.0,,
75%,3.25,6.5,,


You will notice however that when the number of columns is too much the display is messy:

In [39]:
for i in range(20):
    df[i] = i
    
df.head()

Unnamed: 0,two,three,four,five,0,1,2,3,4,5,...,10,11,12,13,14,15,16,17,18,19
a,1.0,2.0,four,four,0,1,2,3,4,5,...,10,11,12,13,14,15,16,17,18,19
b,2.0,4.0,four,four,0,1,2,3,4,5,...,10,11,12,13,14,15,16,17,18,19
c,3.0,6.0,four,,0,1,2,3,4,5,...,10,11,12,13,14,15,16,17,18,19
d,4.0,8.0,four,,0,1,2,3,4,5,...,10,11,12,13,14,15,16,17,18,19


In [40]:
#transposing helps
df.head().T

Unnamed: 0,a,b,c,d
two,1,2,3,4
three,2,4,6,8
four,four,four,four,four
five,four,four,,
0,0,0,0,0
1,1,1,1,1
2,2,2,2,2
3,3,3,3,3
4,4,4,4,4
5,5,5,5,5


Sometimes this will also truncate. To view more you can always change the view options below (btw, there are many many options in pandas, you can check them all out either [here](https://pandas.pydata.org/pandas-docs/stable/options.html) or with a `pd.set_option?`):

In [41]:
pd.set_option('display.max_rows', 100)
pd.set_option('precision', 7)

## Conclusion

Congrats, that's the basics, now you should be able to do the [Getting and knowing Exercises](https://github.com/guipsamora/pandas_exercises#getting-and-knowing) here.

There is much more to do a know about pandas, we'll be going through more here!