# 2.2 Pandas - Series (done)

In [None]:
# to make the .py script runnable
#!/usr/bin/env python

In [1]:
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('ggplot')

## 2.2.1 Intro to Pandas

---

- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly optimized for performance, with critical code paths written in Cython or C.
- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

---

In [1]:
import numpy as np
import pandas as pd

In [2]:
print(pd.__version__)

0.23.4


In [3]:
print(np.__version__)

1.15.4


## 2.2.2 Intro to Series

Series are sets that combine characteristics from 1 dimenionsal arrays (from numpy) and dictionaries. The have the functionality of arrays (allowing for similar mathematical operations), while the indices can be of any (similar to dictionaries). These indices, however, remain ordered (as opposed to dictionaries). They can also be given a title, which will become useful later.

> Syntax: `Series(data=, index=, dtype=, name=)`

## 2.2.3 Defining Series

In [4]:
test_series1 = pd.Series([1, 2, 3])
test_series2 = pd.Series([4, 5, 6])
test_series3 = pd.Series([4, 5, 6], index=[3,4,5])
test_series4 = pd.Series([4, 5, 6], index=list('abc'))

In [5]:
print(test_series1)
print(test_series2)
print(test_series3)
print(test_series4)

0    1
1    2
2    3
dtype: int64
0    4
1    5
2    6
dtype: int64
3    4
4    5
5    6
dtype: int64
a    4
b    5
c    6
dtype: int64


In [9]:
# Append a single item to a series (mind the brackets!)
test_series1.append(pd.Series([4]))

0    1
1    2
2    3
0    4
dtype: int64

**Notice that the index is not 'correct' and that this does not happen INPLACE, but a new series is returned**

In [10]:
test_series1.append(test_series2)

0    1
1    2
2    3
0    4
1    5
2    6
dtype: int64

In [11]:
test_series1.append(test_series3)

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64

In [12]:
test_series1 = test_series1.append(test_series3)
print(test_series1)

0    1
1    2
2    3
3    4
4    5
5    6
dtype: int64


In [13]:
test_series1.append(test_series4)

0    1
1    2
2    3
3    4
4    5
5    6
a    4
b    5
c    6
dtype: int64

In [14]:
test_series1=test_series1.append(test_series2)
test_series1

0    1
1    2
2    3
3    4
4    5
5    6
0    4
1    5
2    6
dtype: int64

In [15]:
test_series1.drop(2)

0    1
1    2
3    4
4    5
5    6
0    4
1    5
dtype: int64

Construct a series, starting from an array:

In [16]:
x_random = np.random.randn(10).round(2) 
x_random

array([ 0.66, -1.25, -0.49, -2.81,  0.23,  0.34,  0.24, -0.84, -0.15,
       -1.33])

In [17]:
type(x_random)

numpy.ndarray

In [18]:
s_random = pd.Series(x_random) # no index specified, numeric will be automatically generated
s_random

0    0.66
1   -1.25
2   -0.49
3   -2.81
4    0.23
5    0.34
6    0.24
7   -0.84
8   -0.15
9   -1.33
dtype: float64

In [19]:
type(s_random)

pandas.core.series.Series

In [20]:
my_series = pd.Series(x_random, index=list('aabbbccdef')) # passing an index specifically
my_series

a    0.66
a   -1.25
b   -0.49
b   -2.81
b    0.23
c    0.34
c    0.24
d   -0.84
e   -0.15
f   -1.33
dtype: float64

In [21]:
pd.Series(x_random, 
       name='my_series_1',                          #the parameter 'name' will become usefull later
       dtype=object, 
       index=['ind_' + str(i) for i in range(10)])

ind_0    0.66
ind_1   -1.25
ind_2   -0.49
ind_3   -2.81
ind_4    0.23
ind_5    0.34
ind_6    0.24
ind_7   -0.84
ind_8   -0.15
ind_9   -1.33
Name: my_series_1, dtype: object

In [22]:
pd.Series(x_random, 
       name='my_series_1',                          #the parameter 'name' will become usefull later
       dtype=float, 
       index=['ind_' + str(i) for i in range(10)])

ind_0    0.66
ind_1   -1.25
ind_2   -0.49
ind_3   -2.81
ind_4    0.23
ind_5    0.34
ind_6    0.24
ind_7   -0.84
ind_8   -0.15
ind_9   -1.33
Name: my_series_1, dtype: float64

Construct a series, starting from a dictionary, list or tuple:

In [23]:
dict_1 = {'a': 1, 'b': 2, 'c':3}

In [24]:
dict_1

{'a': 1, 'b': 2, 'c': 3}

In [25]:
pd.Series(dict_1)

a    1
b    2
c    3
dtype: int64

In [26]:
pd.Series(data=[1, 2, 3], 
       index=list('abc'), 
       name='Series_1', 
       dtype=float)

a    1.0
b    2.0
c    3.0
Name: Series_1, dtype: float64

In [27]:
pd.Series(data=(1, 2, 3), index=list('abc'), 
       name='Series_1', dtype=np.int64)

a    1
b    2
c    3
Name: Series_1, dtype: int64

### 2.2.3.1 Modifying the Index and title

In [28]:
my_new_series = pd.Series(np.random.randn(5).round(2), index = list('abcde'))
my_new_series

a   -1.29
b    1.24
c   -0.67
d    0.46
e   -0.41
dtype: float64

In [29]:
my_new_series.index = list('ab' * 2+'c')

In [30]:
my_new_series

a   -1.29
b    1.24
a   -0.67
b    0.46
c   -0.41
dtype: float64

In [31]:
my_new_series.name = 'ser1'

In [32]:
my_new_series

a   -1.29
b    1.24
a   -0.67
b    0.46
c   -0.41
Name: ser1, dtype: float64

---------------------------------------------------------------------------------------------------------------------

## 2.2.4 Subsetting a Series

<big>

The different methods of subsetting that we've seen so far include

- Using slices or positional indexers (for lists and arrays)
- Using keys (for dictionaries)
- Using bools (for arrays)

For the Pandas Series, we can use either of the above strategies, leveraging specialized methods for pulling data from a Series. 


In [33]:
my_series = pd.Series(np.random.randn(5).round(2), index = list('abcde'))
my_series

a   -0.04
b   -0.99
c   -0.29
d    2.29
e   -0.52
dtype: float64

In [34]:
# One Label
my_series['a']

-0.04

In [35]:
# List of Labels
my_series[['a', 'b']] 

a   -0.04
b   -0.99
dtype: float64

In [36]:
# EMPTY (!) Label Slice
my_series['b':'a']

Series([], dtype: float64)

In [37]:
# Label Slice
my_series['b':'d']

b   -0.99
c   -0.29
d    2.29
dtype: float64

In [38]:
my_series[1:4]

b   -0.99
c   -0.29
d    2.29
dtype: float64

In [39]:
# positional slicing
my_series[:3]

a   -0.04
b   -0.99
c   -0.29
dtype: float64

In [40]:
my_series[:2]

a   -0.04
b   -0.99
dtype: float64

In [41]:
my_series[::-1]

e   -0.52
d    2.29
c   -0.29
b   -0.99
a   -0.04
dtype: float64

In [42]:
my_series[::-2]

e   -0.52
c   -0.29
a   -0.04
dtype: float64

This also works with a Boolean array

In [43]:
pos_num = my_series > 0
pos_num

a    False
b    False
c    False
d     True
e    False
dtype: bool

In [44]:
my_series[pos_num]

d    2.29
dtype: float64

In [45]:
positives = my_series[my_series > 0]
positives  

d    2.29
dtype: float64

We can also call directly by using a key as an attribute.

In [46]:
my_series.a

-0.04

The previously described methods, works will for simple tasks, such as printing certain values of your lists. For more complex tasks, we recommend the `.loc` and `.iloc` methods. These are faster and more robust. This also eliminates any possible confusing between looking for a certain position (show me the second value) and a certain name (show me the value with index '2'). We will talk more about this in notebook nb04.
- .loc[ ] : subsetting based on the position
- .iloc[ ] : subsetting based on the index

In [None]:
#?my_series.loc #Delete the '#' and run for extra info

In [47]:
my_series.loc[['a', 'c', 'e']]

a   -0.04
c   -0.29
e   -0.52
dtype: float64

In [48]:
my_series.loc[my_series > 0] #works with boolean

d    2.29
dtype: float64

In [None]:
?my_series.iloc #Delete the '#' and run for extra info

In [49]:
my_series.iloc[2:4]

c   -0.29
d    2.29
dtype: float64

In [None]:
#my_series.iloc[my_series > 0] # Delete the first '#' to see it does not work with boolean

## 2.2.5 Attributes, methods and functions

### 2.2.5.1 Series Attributes

In [50]:
my_series

a   -0.04
b   -0.99
c   -0.29
d    2.29
e   -0.52
dtype: float64

In [51]:
my_series.values

array([-0.04, -0.99, -0.29,  2.29, -0.52])

In [52]:
my_series.index

Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

In [53]:
type(my_series.index)

pandas.core.indexes.base.Index

### 2.2.5.2 Array Operations on a Series

Array operations on the Series preserves the index-value links.

In [54]:
abs_my_series=np.absolute(my_series)
print(abs_my_series)

a    0.04
b    0.99
c    0.29
d    2.29
e    0.52
dtype: float64


In [55]:
my_series / 2

a   -0.020
b   -0.495
c   -0.145
d    1.145
e   -0.260
dtype: float64

In [56]:
abs(my_series) / 2

a    0.020
b    0.495
c    0.145
d    1.145
e    0.260
dtype: float64

In [58]:
my_series_2 = pd.Series({'c': 1, 'd': 0.14, 'e':10, 'f': 2, 'g':-0.5})
print(my_series,'\n', my_series_2)

a   -0.04
b   -0.99
c   -0.29
d    2.29
e   -0.52
dtype: float64 
 c     1.00
d     0.14
e    10.00
f     2.00
g    -0.50
dtype: float64


In [59]:
print(my_series + my_series_2) #the elementwise addition finds the elements with the SAME (!) keys

a     NaN
b     NaN
c    0.71
d    2.43
e    9.48
f     NaN
g     NaN
dtype: float64


In [60]:
print(my_series * my_series_2) #We will come back to these NaN later.

a       NaN
b       NaN
c   -0.2900
d    0.3206
e   -5.2000
f       NaN
g       NaN
dtype: float64


In [61]:
my_series > my_series_2 # Delete the first '#' to see that NaN are not always produced with similar operations

ValueError: Can only compare identically-labeled Series objects

In [62]:
my_series_3 = pd.Series({'a': 2, 'b': -0.14, 'c':1, 'd': 2.25, 'e':-0.5})

In [63]:
my_series > my_series_3

a    False
b    False
c    False
d     True
e    False
dtype: bool

### 2.2.5.3 The `.isin()` method

In [64]:
pls = pd.Series(['c', 'py', 'java', 'scala'])

In [65]:
pls

0        c
1       py
2     java
3    scala
dtype: object

In [66]:
pls.isin(['c', 'py'])

0     True
1     True
2    False
3    False
dtype: bool

**invert the result (negation with the - sign)**

In [67]:
-pls.isin(['c', 'py'])

0    False
1    False
2     True
3     True
dtype: bool

In [68]:
pls[pls.isin(['java', 'py'])]

1      py
2    java
dtype: object

---------------------------------------------------------------------------------------------------------------------

## Try!

Create again the random grades (max 20) of ten students. Save these as a series. Think about a better way of indexing (as opposed to index 0..9): studentID, name... Create a series with this index. Look for the studentID/name of the student(s) with the highest grades. Transform the grades into percentages.

## Solution