# 2.2 Pandas - Series

In [1]:
# to make the .py script runnable
#!/usr/bin/env python

In [2]:
from sklearn import datasets
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('ggplot')

In [3]:
import os

## 2.2.1 Intro to Pandas

---

- A fast and efficient DataFrame object for data manipulation with integrated indexing;
- Tools for reading and writing data between in-memory data structures and different formats: CSV and text files, Microsoft Excel, SQL databases, and the fast HDF5 format;
- Intelligent data alignment and integrated handling of missing data: gain automatic label-based alignment in computations and easily manipulate messy data into an orderly form;
- Flexible reshaping and pivoting of data sets;
- Intelligent label-based slicing, fancy indexing, and subsetting of large data sets;
- Columns can be inserted and deleted from data structures for size mutability;
- Aggregating or transforming data with a powerful group by engine allowing split-apply-combine operations on data sets;
- High performance merging and joining of data sets;
- Hierarchical axis indexing provides an intuitive way of working with high-dimensional data in a lower-dimensional data structure;
- Time series-functionality: date range generation and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging. Even create domain-specific time offsets and join time series without losing data;
- Highly optimized for performance, with critical code paths written in Cython or C.
- Python with pandas is in use in a wide variety of academic and commercial domains, including Finance, Neuroscience, Economics, Statistics, Advertising, Web Analytics, and more.

---

In [None]:
import numpy as np
import pandas as pd

In [None]:
print(pd.__version__)

## 2.2.2 Intro to Series

Series are sets that combine characteristics from 1 dimenionsal arrays (from numpy) and dictionaries. The have the functionality of arrays (allowing for similar mathematical operations), while the indices can be of any (similar to dictionaries). These indices, however, remain ordered (as opposed to dictionaries). They can also be given a title, which will become useful later.

> Syntax: `Series(data=, index=, dtype=, name=)`

## 2.2.3 Defining Series

In [None]:
test_series1 = pd.Series([1, 2, 3])
test_series2 = pd.Series([4, 5, 6])
test_series3 = pd.Series([4, 5, 6], index=[3,4,5])
test_series4 = pd.Series([4, 5, 6], index=list('abc'))

In [None]:
print(test_series1)
print(test_series2)
print(test_series3)
print(test_series4)

In [None]:
# Append a single item to a series (mind the brackets!)
test_series1.append(pd.Series([4]))

** Notice that the index is not 'correct' and that this does not happen inplace, but a new series is returned **

In [None]:
test_series1.append(test_series2)

In [None]:
test_series1.append(test_series3)

In [None]:
test_series1.append(test_series4)

In [None]:
test_series1=test_series1.append(test_series2)
test_series1

In [None]:
test_series1.drop(2)

Construct a series, starting from an array:

In [None]:
x_random = np.random.randn(10).round(2) 
x_random

In [None]:
type(x_random)

In [None]:
s_random = pd.Series(x_random) # no index specified, numeric will be automatically generated
s_random

In [None]:
type(s_random)

In [None]:
my_series = pd.Series(x_random, index=list('aabbbccdef')) # passing an index specifically
my_series

In [None]:
pd.Series(x_random, 
       name='my_series_1',                          #the parameter 'name' will become usefull later
       dtype=object, 
       index=['ind_' + str(i) for i in range(10)])

Construct a series, starting from a dictionary, list or tuple:

In [None]:
dict_1 = {'a': 1, 'b': 2, 'c':3}

In [None]:
dict_1

In [None]:
pd.Series(dict_1)

In [None]:
pd.Series(data=[1, 2, 3], 
       index=list('abc'), 
       name='Series_1', 
       dtype=float)

In [None]:
pd.Series(data=(1, 2, 3), index=list('abc'), 
       name='Series_1', dtype=np.int64)

### 2.2.3.1 Modifying the Index and title

In [None]:
my_new_series = pd.Series(np.random.randn(5).round(2), index = list('abcde'))
my_new_series

In [None]:
my_new_series.index = list('ab' * 2+'c')

In [None]:
my_new_series

In [None]:
my_new_series.name = 'ser1'

In [None]:
my_new_series

---------------------------------------------------------------------------------------------------------------------

## 2.2.4 Subsetting a Series

<big>

The different methods of subsetting that we've seen so far include

- Using slices or positional indexers (for lists and arrays)
- Using keys (for dictionaries)
- Using bools (for arrays)

For the Pandas Series, we can use either of the above strategies, leveraging specialized methods for pulling data from a Series. 


In [None]:
my_series = pd.Series(np.random.randn(5).round(2), index = list('abcde'))
my_series

In [None]:
# One Label
my_series['a']

In [None]:
# List of Labels
my_series[['a', 'b']] 

In [None]:
# Label Slice
my_series['b':'d']

In [None]:
my_series[1:4]

In [None]:
# positional slicing
my_series[:3]

In [None]:
my_series[:2]

In [None]:
my_series[::-1]

In [None]:
my_series[::-2]

This also works with a Boolean array

In [None]:
pos_num = my_series > 0
pos_num

In [None]:
my_series[pos_num]

In [None]:
positives = my_series[my_series > 0]
positives  

We can also call directly by using a key as an attribute.

In [None]:
my_series.a

The previously described methods, works will for simple tasks, such as printing certain values of your lists. For more complex tasks, we recommend the `.loc` and `.iloc` methods. These are faster and more robust. This also eliminates any possible confusing between looking for a certain position (show me the second value) and a certain name (show me the value with index '2'). We will talk more about this in notebook nb04.
- .loc[ ] : subsetting based on the position
- .iloc[ ] : subsetting based on the index

In [None]:
#?my_series.loc #Delete the '#' and run for extra info

In [None]:
my_series.loc[['a', 'c', 'e']]

In [None]:
my_series.loc[my_series > 0] #works with boolean

In [None]:
#?my_series.iloc #Delete the '#' and run for extra info

In [None]:
my_series.iloc[2:4]

In [None]:
#my_series.iloc[my_series > 0] # Delete the first '#' to see it does not work with boolean

## 2.2.5 Attributes, methods and functions

### 2.2.5.1 Series Attributes

In [None]:
my_series

In [None]:
my_series.values

In [None]:
my_series.index

In [None]:
type(my_series.index)

### 2.2.5.2 Array Operations on a Series

Array operations on the Series preserves the index-value links.

In [None]:
abs_my_series=np.absolute(my_series)
print(abs_my_series)

In [None]:
my_series / 2

In [None]:
my_series_2 = pd.Series({'c': 1, 'd': 0.14, 'e':10, 'f': 2, 'g':-0.5})
print(my_series,'\n', my_series_2)

In [None]:
print(my_series + my_series_2) #the elementwise addition finds the elements with the same keys


In [None]:
print(my_series * my_series_2) #We will come back to these NaN later.

In [None]:
#my_series > my_series_2 # Delete the first '#' to see that NaN are not always produced with similar operations

In [None]:
my_series_3 = pd.Series({'a': 2, 'b': -0.14, 'c':1, 'd': 2.25, 'e':-0.5})

In [None]:
my_series > my_series_3

### 2.2.5.3 The `.isin()` method

In [None]:
pls = pd.Series(['c', 'py', 'java', 'scala'])

In [None]:
pls.isin(['c', 'py'])

In [None]:
-pls.isin(['c', 'py'])

In [None]:
pls[pls.isin(['java', 'py'])]

---------------------------------------------------------------------------------------------------------------------

## Try!

Create again the random grades (max 20) of ten students. Save these as a series. Think about a better way of indexing (as opposed to index 0..9): studentID, name... Create a series with this index. Look for the studentID/name of the student(s) with the highest grades. Transform the grades into percentages.

## Solution

In [1]:
import numpy as np
import pandas as pd 

In [2]:
grades = np.random.randint(0, 21, 10)
grades

array([10,  9, 15,  2, 15,  0,  4, 15, 13, 13])

In [3]:
grades_ser=pd.Series(grades,index=['Tony','Steve','Thor','Bruce','Natasha','Clint','Pietro','Wanda','James','Vision'])
grades_ser

Tony       10
Steve       9
Thor       15
Bruce       2
Natasha    15
Clint       0
Pietro      4
Wanda      15
James      13
Vision     13
dtype: int64

In [4]:
grades_ser.max()

15

In [5]:
grades_ser[grades_ser==grades_ser.max()].index

Index(['Thor', 'Natasha', 'Wanda'], dtype='object')

In [8]:
grades_ser_perc=grades_ser*5
grades_ser_perc

Tony       50
Steve      45
Thor       75
Bruce      10
Natasha    75
Clint       0
Pietro     20
Wanda      75
James      65
Vision     65
dtype: int64

In [9]:
grades_ser_perc=grades_ser/20*100
grades_ser_perc

Tony       50.0
Steve      45.0
Thor       75.0
Bruce      10.0
Natasha    75.0
Clint       0.0
Pietro     20.0
Wanda      75.0
James      65.0
Vision     65.0
dtype: float64