# Pandas Data Structures - Series<br>
1. Series <br>
It is a 1-d array of data (similar to an array/list/column in a table) with an associated labeled index. <br>
It can be created in the same way as a NumPy array is created <br>
Creating a Series from arrays, lists, dicts, tuples <br>
    >a. Attributes <br>
    >b. Subsetting <br>
    >c. Methods
2. Syntax: series(data=, index=, dtype=, name=)


In [9]:
import numpy as np
import pandas as pd
#from pandas import Series
pd.Series?

# How to create Series?

In [13]:
#Creating a series using an Array
#1. Creating an array
x_random = np.random.randn(5)
print x_random

#2. Converting array into series
my_series = pd.Series(x_random)
print my_series

[ 0.46703475 -0.14096726 -0.59425859 -0.25579921 -0.35098542]
0    0.467035
1   -0.140967
2   -0.594259
3   -0.255799
4   -0.350985
dtype: float64


In [None]:
print dir(my_series),

In [None]:
#Series with Explicit Index
my_series_with_explicit_index = pd.Series(x_random.round(2), index=list('abcde'))
print my_series_with_explicit_index, '\n'
print my_series_with_explicit_index.values, '\n'
print my_series_with_explicit_index.index, '\n'

In [16]:
#Creating Series using a list, tuple or dict
# From a list
pd.Series([1, 2, 3], index=list('XYZ'), name='Series_1', dtype=float)

X    1.0
Y    2.0
Z    3.0
Name: Series_1, dtype: float64

In [None]:
# From a tuple
pd.Series((1, 2, 3), index=list('abc'), name='Series_1', dtype=float)

In [None]:
# From a dict
pd.Series({'a': 1, 'b': 2, 'c':3}, dtype=float, name='Series_from_dict')

In [20]:
city_dict = {'Delhi': 100, 'Nagpur': None, 'Pune': 600, 
                    'Mumbai': 700, 'Chennai': 450, 'Lucknow': None}

cities = pd.Series(city_dict)
print cities

Chennai    450.0
Delhi      100.0
Lucknow      NaN
Mumbai     700.0
Nagpur       NaN
Pune       600.0
dtype: float64


In [22]:
cities.index

Index([u'Chennai', u'Delhi', u'Lucknow', u'Mumbai', u'Nagpur', u'Pune'], dtype='object')

---
## Series Attributes

- An attribute contains METADATA about the data structure
- accessed using the dot operator `my_series.<attr-name>`

In [None]:
print my_series
print
type(my_series)
print
print my_series.values
print
print my_series.index
print
print type(my_series.values)
print
print type(my_series.index)
print
print my_series.shape
print
print my_series.nbytes
print

### From Series to dictionary, list

In [23]:
my_series.tolist()

[0.46703475262642785,
 -0.14096725652312447,
 -0.594258594742233,
 -0.25579920981191817,
 -0.3509854238838949]

In [None]:
my_series.to_dict()

In [None]:
my_series_with_explicit_index.tolist()

In [None]:
my_series_with_explicit_index.to_dict()

## Subsetting a Series

We can use 
- label-based indexing by passing index labels associated with the values
> - Single/list of labels <br>
> - Slice of labels <br>
> - Positional slicing <br>
> - Reversing the series <br>
- Fancy indexing using methods like loc, iloc, ix, at, iat 
> - .loc() for label based subsetting <br>
> - .iloc() for integer based subsetting <br>
> - .ix() and .at(), .iat() exist, but they serve the same purpose like loc and iloc <br>
- Boolean indexing for subsetting with logical arrays
> - boolean indexing works in the same way as it does for subsetting NumPy arrays. We create a boolean of the same length as the Series, (using the same Series), and then pass the boolean to the squre bracket subsetter <br>

This is mostly similar to numPy array slicing except the returned values have the index associated.

In [None]:
my_series = pd.Series(np.random.randn(5), index = list('abcde')).round(2)
my_series

In [None]:
# One Label
print my_series['a']
print

# List of Labels
print my_series[['b','c','e']] 
print

# Label Slice
print my_series['a':'e']
print
print my_series[:'b']
print

# Positional Slicing
print my_series[0:3] 
print
print my_series[:2]
print

#Reverse the series
print my_series[::-1]
print


In [None]:
#Series Slicing using methods like loc, iloc, ix
# LABEL BASED INDEXER METHOD
#.loc() for label based subsetting 
#.iloc() for integer based subsetting
# my_series.loc?

#%timeit my_series.loc[['a', 'c', 'e']]
print my_series.loc[['a', 'c', 'e']] 
print

#%timeit my_series[['a', 'c', 'e']]
print my_series[['a', 'c', 'e']]
print

# INTEGER BASED INDEXED METHOD
#my_series.iloc?

print my_series.iloc[0:2]
print 

print my_series.loc[:-3]
print 

# MIXED LABELS and INTEGERS BASED INDEXER METHOD
#.ix[] supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access
#my_series.ix?

print my_series.ix['a':'c']
print 

print my_series.ix[0:2]
print 

## Boolean Series and Indexing
- Use conditional operators to create an equal-length Boolean series
- Subset the series using this boolean inside square brackets

In [27]:
my_series = pd.Series(np.random.randn(15).round(2), index=list('ABC'*5))
print my_series

A   -1.09
B    0.80
C    0.59
A    1.32
B    0.23
C   -0.49
A    1.29
B   -0.85
C    0.09
A   -0.87
B   -0.09
C    0.31
A    0.41
B    0.57
C    0.91
dtype: float64


In [None]:
my_series.loc['A':'C']

In [28]:
my_series > 0

A    False
B     True
C     True
A     True
B     True
C    False
A     True
B    False
C     True
A    False
B    False
C     True
A     True
B     True
C     True
dtype: bool

In [29]:
my_series[my_series > 0]

B    0.80
C    0.59
A    1.32
B    0.23
A    1.29
C    0.09
C    0.31
A    0.41
B    0.57
C    0.91
dtype: float64

In [None]:
#subsetting using combination of methods
print my_series.max()
print my_series.idxmax()
# Subsetting with 'ix'
    # Select the biggest value in a Series
print my_series.ix[my_series.idxmax()]

## Data Wrangling Tasks

- Peaking the data: 
> head and tail  are used to view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number of elements to display is five, but you may pass a custom number.
- Array Operations on a Series <br>
> Array operations on the Series preserves the index-value links <br>
> Alignment in Arithmetic Operations <br>
> Series with different indexes will be automatically aligned, and NaNs induced in locations where data is not found. <br>
> The indexes are _unioned_. <br>
> Think of binary operations as outer joins.
- Series in a list/dict comprehension <br>
- Checking values belonging to a list: 
> isin produces a boolean by comparing each element of the Series against the provided list. It takes True if the element belongs to the list. This boolean may then be used for subsetting the Series. 
- Reindexing <br>
- Type Conversion: 
> astype explicitly convert dtypes from one to another <br>
- Treating Outliers: 
>clip_upper, clip_lower can be used to clip outliers at a threshold value. All values lower than the one supplied to clip_lower , or higher than the one supplied to clip_upper will be replaced by that value. <br>
> This function is especially useful in treating outliers when used in conjunction with .quantile() <br>
> ( Note: In data wrangling, we generally clip values at the 1st-99th Percentile (or the 5th-95th 	percentile)) <br>
- Replacing Values: 
> replace is an effective way to replace source values with target values by suppling a dictionary with the required substitutions
- Finding uniques and their frequency: unique, nunique, value_counts 
> These methods are used to find the array of distinct values in a categorical Series, to count the number of distinct items, and to create a frequency table respectively. <br>
- Dealing with Duplicates: duplicated 
> Duplicated produces a boolean that marks every instance of a value after its first occurrence as True. drop_duplicates returns the Series with the duplicates removed. If you want to drop duplicated permanently, pass the inplace=True argument. <br>
- Finding the largest/smallest values: idxmax, idxmin, nlargest, nsmallest 
> As their names imply, these methods help in finding the largest, smallest, n-largest and n-smallest respectively. Note that the index label is returned with these values, and this can be especially helpful in many cases.<br>
- Sorting the data: 
> sort_values , sort_index help in sorting a Series by values or by index. Note: that in order to make the sorting permanent, we need to pass an inplace=True argument.
- Mathematical Summaries:  
> mean, median, std, quantile, describe are mathematical methods employed to find the measures of central tendency for a given set of data points. quantile finds the requested percentiles, whereas describe produces the summary statistics for the data. <br> 
- Dealing with missing data: 
> isnull, notnull are complementary methods that work on a Series with missing data to produce boolean Series to identify missing or non-missing values respectively. Note that both the NumPy np.nan and the base Python None type are identified as missing values <br>
- Missing values imputation: 
> fillna, ffill and bfill, dropna This set of Series methods allow us to deal with missing data by choosing to either impute them with a particular value, or by copying the last known value over the missing ones (typically used in time-series analysis.) We may sometimes want to drop the missing data altogether and dropna helps us in doing that. <br>
> (Note: It is a common practice in data science to replace missing values in a numeric variable by its mean (or median if the data is skewed) and in categorical variables with its mode <br>

- Apply function to each element: 
> map is perhaps the most important of all series methods. It takes a general-purpose or user-defined function and applies it to each value in the Series. Combined with base Python's lambda functions, it can be an incredibly powerful tool in transforming a given Series. <br>
> This sounds like the  map function for List objects in Base Python. The .map() method can be understood as a wrapper around that function <br>
- Visualizing the data: 
> The plot method is a gateway to a treasure trove of potential visualizations like histograms, bar charts, scatterplots, boxplots and more.<br>



In [33]:
my_series= pd.Series(np.random.randn(15).round(2), index=list('ABC'*5))

In [None]:
#peaking the data
print my_series.head()
print
print my_series.tail()
print

In [None]:
#Array operations on a series: Array operations on the Series preserves the index-value links
#Methods that apply to dicts are also valid on a Series.

print np.sqrt(my_series)
print 
print my_series + my_series
print 
print my_series / 2
# The index-value linkages are preserved.
my_series2 = my_series2 * 2; print my_series2

In [None]:
my_series > my_series * 2

In [None]:
cities = pd.Series({'Delhi': 100, 'Nagpur': None, 'Pune': 600, 
                    'Mumbai': 700, 'Chennai': 450, 'Lucknow': None})
print cities
print '\n'
# Adding two Series together returns a union of the two Series with the addition occurring on the shared index values. 
# Values on either Series that did not have a shared index will produce a NULL/NaN (not a number).

print cities[['Chennai', 'Pune', 'Mumbai']]
print'\n'
print cities[['Pune', 'Delhi']]
print'\n'
print cities[['Pune', 'Delhi']] + cities[['Chennai', 'Pune', 'Mumbai']]

In [None]:
#Checking values belonging to a list
#1. using in operator
my_series = pd.Series(['foo', 'bar', 'boo', 'far'])
my_series

In [None]:
print 1 in my_series.index
print 
print 'foo' in my_series.index

In [None]:
diner = pd.Series({'ham': 1, 'eggs': 3, 'bacon': 2, 'coffee': 1, 'toast': 0.5, 'jam': 0.2})

In [None]:
'pancakes' in diner # this matches only the labels not values

In [None]:
diner['pancakes'] = 5 # replace values

In [None]:
'bacon' in diner.index
# or 'bacon' in diner.keys()

In [None]:
'bacon' in diner

In [None]:
#THE .isin() method
pls = pd.Series(['c', 'py', 'java', 'scala', 'swift'])
print pls
print
print pls.isin(['a', 'b'])
print
print pls.isin(['r', 'py', 'vba', 'swift'])
print
print pd.Series([X in ['r', 'py', 'vba', 'swift'] for X in pls.values])
print
print pls[pls.isin(['java', 'py'])]

In [94]:
# comprehensions
iter(pls)
iter(pls.index)
iter(pls.values)

<iterator at 0x88245f8>

In [None]:
print diner
print
print pd.Series({k: v + 2 for k, v in diner.iteritems() if k not in ['ham', 'jam']})
print
print [x + 2 for x in diner]
print
print diner +2

In [43]:
#Re-indexing
my_series = pd.Series(np.random.randn(5), index = list('abcde')).round(2)
#print my_series

my_series['x'] = 10
my_series

my_series['z'] = 666
my_series

a      2.56
b     -0.03
c      0.13
d     -1.82
e      0.31
x     10.00
z    666.00
dtype: float64

In [None]:
#TYPE CONVERSION
#astype explicitly convert dtypes from one to another
my_series=pd.Series(np.random.randn(1000).round(2))
print my_series.head()
print
print my_series.astype(str).head() 

In [None]:
#Handling Outliers - Method1
print my_series.head(10)
print my_series.head(10).clip_upper(.50) 
print my_series.head(10).clip_lower(.50) 

#Handling Outliers-Method2
print my_series.head(10)
print my_series.head(10).clip_upper(my_series.quantile(0.01)) 
print my_series.head(10).clip_lower(my_series.quantile(0.99)) 

In [None]:
#Replace value
fruits = pd.Series(['apples', 'oranges', 'peaches', 'mangoes']) 
print fruits
fruits.replace({'apples':'grapes', 'peaches':'bananas'}) 

In [None]:
fruits.replace({'apples':'grapes', 'peaches':'bananas'}) 

In [None]:
#Detect Missing Values
#Missing values appear as NaN. Funtions isnull and notnull are used to detect missings.
#They both produce booleans that can be used for subsetting

my_series=pd.Series([1.12, 3.14, np.nan, 6.02, 2.73, None])
print my_series
print
print my_series.isnull()
print
print my_series.notnull()

In [None]:
my_series2 = my_series[my_series.notnull()]

In [None]:
my_series[my_series.isnull()]

In [None]:
my_series.fillna(0)

In [47]:
my_series.mean()

3.2525

In [48]:
my_series.fillna(my_series.mean())

0    1.1200
1    3.1400
2    3.2525
3    6.0200
4    2.7300
5    3.2525
dtype: float64

In [49]:
my_series.fillna(method='ffill')

0    1.12
1    3.14
2    3.14
3    6.02
4    2.73
5    2.73
dtype: float64

In [None]:
my_series.fillna(method='bfill')

## Difference between NA and NaN

- NaN is a mathematical entity
- NA is for missing data

In [None]:
print type(np.nan)
print type(None)

In [None]:
#Finding uniques and their frequency
my_series = pd.Series(list('abcd' * 3)) 
print my_series
print
print my_series.unique()
print
#print np.array(['a', 'b', 'c', 'd'], dtype=object)
print
print my_series.nunique()
print
print my_series.value_counts() 


In [None]:
#Handling duplicates
my_series.duplicated() 

In [None]:
my_series.drop_duplicates() 

In [None]:
#Finding the largest & smallest values
my_series =  pd.Series(np.random.randint(0, 50, 6), index=list('xyzabc'))
print my_series
print
print my_series.idxmax()
print
print my_series.idxmin()
print
print my_series.nlargest(3)
print
print my_series.nsmallest(3)

In [None]:
#Sorting the data
my_series.sort_values()

In [None]:
my_series.sort_index()

In [None]:
#Mathematical summaries
my_series=pd.Series(np.random.randn(100))
print my_series.head(5)
print
print my_series.mean()
print
print my_series.std()
print
print my_series.sum()
print
print my_series.count()
print
print my_series.median()
print
print my_series.quantile([0.10, 0.25,0.5,0.75,0.9,0.99])
print
print my_series.describe()

In [None]:
#Handling Missing data
my_series=pd.Series([1.12, 3.14, np.nan, 6.02, 2.73, None])
print my_series
print my_series.isnull()
print my_series.notnull()

In [None]:
# Let's say we have a list of names stored in a Series In [125]: 
my_series = pd.Series(['Dave Smith', 'Jane Doe', 'Carl James', 'Jim Hunt'])
print my_series

# Find the length of each name 
print my_series.map(lambda x: len(x)) 

#Find the initials

print my_series.map(lambda x: '.'.join([i[0] for i in x.split(' ')])) 


In [None]:
#Visualize the data
# Create a categorical series 
my_series =  pd.Series(list('a' * 3) + list('b' * 5) + list('c' * 9) + list('d' * 2))
my_series.head()


In [None]:
my_series.value_counts().plot.bar() 

In [None]:
# Create a numerical series
my_series=Series(np.random.randn(1000))
print my_series
my_series.plot.hist()
