<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Pandas-Series" data-toc-modified-id="Pandas-Series-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Pandas Series</a></span></li><li><span><a href="#Series-Methods" data-toc-modified-id="Series-Methods-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Series Methods</a></span></li><li><span><a href="#Pandas-DataFrames" data-toc-modified-id="Pandas-DataFrames-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Pandas DataFrames</a></span></li><li><span><a href="#Important-Pands-Functions-(comparison--with-R)" data-toc-modified-id="Important-Pands-Functions-(comparison--with-R)-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Important Pands Functions (comparison  with R)</a></span></li><li><span><a href="#Importing-Other-data-format-into-Python" data-toc-modified-id="Importing-Other-data-format-into-Python-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Importing Other data format into Python</a></span></li><li><span><a href="#Adding-a-series-to-data-frames" data-toc-modified-id="Adding-a-series-to-data-frames-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Adding a series to data frames</a></span></li><li><span><a href="#Loading,-Subsetting,-and-Filtering-Data-with-pandas" data-toc-modified-id="Loading,-Subsetting,-and-Filtering-Data-with-pandas-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Loading, Subsetting, and Filtering Data with pandas</a></span></li></ul></div>

In [2]:
import numpy as np
import pandas as pd
import random

In [2]:
data = pd.read_csv("data/income.csv")

In [6]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 16 columns):
Index    51 non-null object
State    51 non-null object
Y2002    51 non-null int64
Y2003    51 non-null int64
Y2004    51 non-null int64
Y2005    51 non-null int64
Y2006    51 non-null int64
Y2007    51 non-null int64
Y2008    51 non-null int64
Y2009    51 non-null int64
Y2010    51 non-null int64
Y2011    51 non-null int64
Y2012    51 non-null int64
Y2013    51 non-null int64
Y2014    51 non-null int64
Y2015    51 non-null int64
dtypes: int64(14), object(2)
memory usage: 6.5+ KB


In [7]:
data.columns

Index(['Index', 'State', 'Y2002', 'Y2003', 'Y2004', 'Y2005', 'Y2006', 'Y2007',
       'Y2008', 'Y2009', 'Y2010', 'Y2011', 'Y2012', 'Y2013', 'Y2014', 'Y2015'],
      dtype='object')

## Pandas Series

The Series is the primary building block of pandas. A Series represents a one-dimensional labeled indexed array based on the NumPy ndarray. Like an array, a Series can hold zero or more values of any single data type.

* The Series object is basically interchangeable with a one-dimensional NumPy array. 
* The essential difference is the presence of the index: while the Numpy Array has an implicitly defined integer index used to access the values, the Pandas Series has an explicitly defined index associated with the values.

* This explicit index definition gives the Series object additional capabilities. 
* For example, the index need not be an integer, but can consist of values of any desired type

Creating Series From List

In [2]:
ser = pd.Series([1, 3, 5, 7])
print(ser)

0    1
1    3
2    5
3    7
dtype: int64


Use a Dict to Initialize Series

In [3]:
prices = {'apple': 4.99,
'banana': 1.99,
'orange': 3.99,
'grapes': 0.99}
ser = pd.Series(prices)
print (ser)

apple     4.99
banana    1.99
grapes    0.99
orange    3.99
dtype: float64


 Initialize Series from Scalar

In [4]:
ser = pd.Series(2, index=range(0, 5))
print(ser)

0    2
1    2
2    2
3    2
4    2
dtype: int64


A Series of Odd Numbers

In [6]:
print (pd.Series(range(1, 10, 2)))


0    1
1    3
2    5
3    7
4    9
dtype: int64


With an Alphabetic Index

In [7]:
print (pd.Series(range(1, 15, 3), index=[x for x in 'abcde']))

a     1
b     4
c     7
d    10
e    13
dtype: int64


A Series With Random Numbers

In [11]:
import random
print(pd.Series(random.sample(range(100), 6)))

0    75
1    45
2    96
3    82
4    74
5    34
dtype: int64


Combining Lists

In [13]:
x = dict(zip([x for x in 'abcdefg'], range(1, 8)))
print (x)
y = pd.Series(x)
print (y)

{'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5, 'f': 6, 'g': 7}
a    1
b    2
c    3
d    4
e    5
f    6
g    7
dtype: int64


Specifying an Index

In [14]:
print (pd.Series(range(1,8), index=[x for x in 'abcdefg']))

a    1
b    2
c    3
d    4
e    5
f    6
g    7
dtype: int64


Naming a Series

In [16]:
a = [1, 3, 5, 7]
print (pd.Series(a, name='joe'))

0    1
1    3
2    5
3    7
Name: joe, dtype: int64


 Comparing List with Series

In [23]:
ser = pd.Series(random.sample(range(100), 10))
print (ser)
print
print ('4th element: ', ser[4])
print ('Slice: ', ser[3:8])


0    10
1    26
2    88
3    81
4    53
5    92
6    83
7    65
8    32
9    11
dtype: int64
4th element:  53
Slice:  3    81
4    53
5    92
6    83
7    65
dtype: int64


## Series Methods

In [5]:
a = pd.Series(random.sample(range(100), 6))
print (a.shape)
# check how big the Series is using shape , which returns as a tuple 

(6,)


In [6]:
print (a.count()) # which returns the size of the Series as an integer 

6


**`count()` only reports the number of non-NaN elements, while shape reports both.**
Another attribute for getting the count of elements is `size` . It reports the count as an integer and includes NaN elements if any.

In [9]:
a = pd.Series(random.sample(range(100), 6))
print ('count of a =>', a.count(), '\n')
b = a.append(pd.Series(np.nan, index=list('abcd')), ignore_index=True)
print ('b => ', b, '\n')
print ('count of b =>', b.count(), '\n')
print ('shape of b =>', b.shape, '\n')
print ('size of b =>', b.size)

count of a => 6 

b =>  0    14.0
1    72.0
2    79.0
3     0.0
4    83.0
5    30.0
6     NaN
7     NaN
8     NaN
9     NaN
dtype: float64 

count of b => 6 

shape of b => (10,) 

size of b => 10


**Get some detailed stats on the Series using `describe()` . This method returns a Series object with the index (or labels) as shown.**



In [10]:
x = pd.Series(random.sample(range(100), 6))
x.describe()

count     6.000000
mean     74.333333
std      22.303961
min      47.000000
25%      56.250000
50%      75.000000
75%      93.750000
max      99.000000
dtype: float64

**Show the first 5 or last 5 rows of the Series using `head()` or `tail()`**

In [12]:
x = pd.Series(random.sample(range(100), 10))
print (x, '\n')
print (x.head(), '\n')
print (x.tail(), '\n')

0    78
1    28
2     9
3    79
4    85
5    73
6    83
7    66
8    44
9    36
dtype: int64 

0    78
1    28
2     9
3    79
4    85
dtype: int64 

5    73
6    83
7    66
8    44
9    36
dtype: int64 



**Adding elements to a Series is accomplished by using `append()` . The argument must be a single Series object, or a list (or tuple) of Series objects.**


In [13]:
x = pd.Series(random.sample(range(100), 6))
print (x, '\n')
print ('appended =>\n', x.append([pd.Series(2), pd.Series([3, 4, 5])]))

0    30
1    93
2    83
3    98
4     3
5     0
dtype: int64 

appended =>
 0    30
1    93
2    83
3    98
4     3
5     0
0     2
0     3
1     4
2     5
dtype: int64


In [14]:
print ('appended =>\n', x.append([pd.Series(2), pd.Series([3, 4, 5])],
ignore_index=True))


appended =>
 0    30
1    93
2    83
3    98
4     3
5     0
6     2
7     3
8     4
9     5
dtype: int64


**You can delete elements from a Series using the following methods.**

In [16]:
x = pd.Series(random.sample(range(100), 6), index=list('ABCDEF'))
print (x, '\n')
print ('drop one =>\n', x.drop('C'), '\n')
print ('drop many =>\n', x.drop(['C', 'D']))

A    52
B    19
C     8
D    85
E    21
F    90
dtype: int64 

drop one =>
 A    52
B    19
D    85
E    21
F    90
dtype: int64 

drop many =>
 A    52
B    19
E    21
F    90
dtype: int64


**Get rid of duplicate elements by invoking `drop_duplicates()`**

In [17]:
x = pd.Series([1, 2, 2, 4, 5, 7, 3, 4])
print (x, '\n')
print ('drop duplicates =>\n', x.drop_duplicates(), '\n')

0    1
1    2
2    2
3    4
4    5
5    7
6    3
7    4
dtype: int64 

drop duplicates =>
 0    1
1    2
3    4
4    5
5    7
6    3
dtype: int64 



By default, the method retains the first repeated value. Get rid of all duplicates (including
the first) by specifying `keep=False` .

In [18]:
x = pd.Series([1, 2, 2, 4, 5, 7, 3, 4])
print (x, '\n')
print ('drop duplicates =>\n', x.drop_duplicates(keep=False), '\n')

0    1
1    2
2    2
3    4
4    5
5    7
6    3
7    4
dtype: int64 

drop duplicates =>
 0    1
4    5
5    7
6    3
dtype: int64 



**Use the `dropna()` to drop elements without a value (NaN).**


In [19]:
x = pd.Series([1, 2, 3, 4, np.nan, 5, 6])
print (x, '\n')
print ('drop na =>\n', x.dropna())

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    5.0
6    6.0
dtype: float64 

drop na =>
 0    1.0
1    2.0
2    3.0
3    4.0
5    5.0
6    6.0
dtype: float64


**When you want to replace NaN elements in a Series, use `fillna()` .**

In [20]:
x = pd.Series([1, 2, 3, 4, np.nan, 5, 6])
print (x, '\n')
print ('fillna w/0 =>\n', x.fillna(0))

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    5.0
6    6.0
dtype: float64 

fillna w/0 =>
 0    1.0
1    2.0
2    3.0
3    4.0
4    0.0
5    5.0
6    6.0
dtype: float64


**Use the `between()` method, which returns a Series of boolean values indicating whether
the element lies within the range**

In [21]:
a = pd.Series(random.sample(range(100), 10))
print (a)
print (a.between(30, 50))

0    33
1    26
2    99
3    87
4    52
5    29
6    90
7    75
8    31
9    32
dtype: int64
0     True
1    False
2    False
3    False
4    False
5    False
6    False
7    False
8     True
9     True
dtype: bool


In [22]:
# You can use this the returned boolean Series as a predicate into the original Series.
print (a[a.between(30, 50)])

0    33
8    31
9    32
dtype: int64


**Select elements using a predicate function as the argument to `select()` .**


In [23]:
x = pd.Series(random.sample(range(100), 6))
print (x, '\n')
print ('select func =>\n', x.select(lambda a: x.iloc[a] > 20))

0    25
1    60
2    72
3     2
4    34
5     0
dtype: int64 

select func =>
 0    25
1    60
2    72
4    34
dtype: int64


**Use `filter(items=[..])` with the labels to be selected in a list.**

In [24]:
x = pd.Series([1, 2, 3, 4, np.nan, 5, 6])
print (x, '\n')
print ('filtered =>\n', x.filter(items=[1, 2, 6]))

0    1.0
1    2.0
2    3.0
3    4.0
4    NaN
5    5.0
6    6.0
dtype: float64 

filtered =>
 1    2.0
2    3.0
6    6.0
dtype: float64


**Select labels to filter using a regular expression match with `filter(regex=’..’)` .**

In [25]:
x = pd.Series({'apple': 1.99,
'orange': 2.49,
'banana': 0.99,
'grapes': 1.49,
'melon': 3.99})
print (x, '\n')
print ('regex filter =>\n', x.filter(regex='a'))

apple     1.99
banana    0.99
grapes    1.49
melon     3.99
orange    2.49
dtype: float64 

regex filter =>
 apple     1.99
banana    0.99
grapes    1.49
orange    2.49
dtype: float64


**Use the `filter(like=’..’)` version to perform a substring match on the labels to be selected.**



In [35]:
print ('like filter =>\n', x.filter(like='an'))

like filter =>
 banana    0.99
orange    2.49
dtype: float64


**you can sort a Series by value.**
**Use `sort_values()` to sort by the values.**

In [39]:
print (x, '\n')
print ('sort by value: =>\n', x.sort_values())

apple     1.99
banana    0.99
grapes    1.49
melon     3.99
orange    2.49
dtype: float64 

sort by value: =>
 banana    0.99
grapes    1.49
apple     1.99
orange    2.49
melon     3.99
dtype: float64


## Pandas DataFrames

## Important Pands Functions (comparison  with R)

![Imgur](https://i.imgur.com/Wsh9hfp.png)
<sub>source: <a href="https://www.listendata.com/2017/05/python-data-science.html#pandas_functions" target="_blank">https://www.listendata.com/2017/05/python-data-science.html#pandas_functions</a></sub>  


## Importing Other data format into Python

https://www.listendata.com/2017/02/import-data-in-python.html

In [9]:
mydata = pd.read_table("data/Download data.txt")

In [10]:
mydata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 1 columns):
*&---------------------------------------------------------------------*    84 non-null object
dtypes: object(1)
memory usage: 752.0+ bytes


In [12]:
mydata1  = pd.read_csv("data/Download data.txt", sep ="\t")


In [13]:
mydata.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 1 columns):
*&---------------------------------------------------------------------*    84 non-null object
dtypes: object(1)
memory usage: 752.0+ bytes


In [11]:
mydata.columns

Index(['*&---------------------------------------------------------------------*'], dtype='object')

In [3]:
mydata3 = pd.read_excel("https://www.eia.gov/dnav/pet/hist_xls/RBRTEd.xls",sheetname="Data 1", skiprows=2)

In [4]:
mydata3.shape

(7952, 2)

In [5]:
mydata3.head()

Unnamed: 0,Date,Europe Brent Spot Price FOB (Dollars per Barrel)
0,1987-05-20,18.63
1,1987-05-21,18.45
2,1987-05-22,18.55
3,1987-05-25,18.6
4,1987-05-26,18.63


In [23]:
mydata3.dtypes

Date                                                datetime64[ns]
Europe Brent Spot Price FOB (Dollars per Barrel)           float64
dtype: object

In [25]:
mydata3['Europe Brent Spot Price FOB (Dollars per Barrel)'].max()
mydata3['Europe Brent Spot Price FOB (Dollars per Barrel)'].min()


9.1

In [26]:
mydata3.describe()

Unnamed: 0,Europe Brent Spot Price FOB (Dollars per Barrel)
count,7903.0
mean,45.498118
std,33.038214
min,9.1
25%,18.6
50%,29.62
75%,65.945
max,143.95


**Exporting data(dataframe) into local drive**

In [27]:
    mydata3.to_csv("data/mydata3.csv")

## Adding a series to data frames

In [8]:
ser = pd.Series(["one", "two", "three"], index =[0,4, 6])
mydata3["ser"] = ser
mydata3.head(15)

Unnamed: 0,Date,Europe Brent Spot Price FOB (Dollars per Barrel),ser
0,1987-05-20,18.63,one
1,1987-05-21,18.45,
2,1987-05-22,18.55,
3,1987-05-25,18.6,
4,1987-05-26,18.63,two
5,1987-05-27,18.6,
6,1987-05-28,18.6,three
7,1987-05-29,18.58,
8,1987-06-01,18.65,
9,1987-06-02,18.68,


## Loading, Subsetting, and Filtering Data with pandas

https://towardsdatascience.com/data-science-with-python-intro-to-loading-and-subsetting-data-with-pandas-9f26895ddd7f  
https://medium.com/dunder-data/selecting-subsets-of-data-in-pandas-6fcd0170be9c  

**Notebooks**  
https://github.com/tdpetrou/Learn-Pandas/tree/master/Learn-Pandas/Selecting%20Subsets  

**sample datasets link for practice**  
https://vincentarelbundock.github.io/Rdatasets/datasets.html

<span style="color:red; font-family:brandon">Further  Resources</span>  
<a href="https://github.com/manujeevanprakash/Pandas-basics/blob/master/Pandas.ipynb" target="_blank">https://github.com/manujeevanprakash/Pandas-basics/blob/master/Pandas.ipynb</a>  
<a href="https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python" target="_blank">https://www.datacamp.com/community/tutorials/pandas-tutorial-dataframe-python</a>  
<a href="https://www.dataschool.io/best-practices-with-pandas/" target="_blank">https://www.dataschool.io/best-practices-with-pandas/</a>  

<span style="color:red; font-family:Comic Sans MS">References</span>  
<a href="https://www.listendata.com/2017/05/python-data-science.html" target="_blank">https://www.listendata.com/2017/05/python-data-science.html</a>  
<a href="https://jakevdp.github.io/PythonDataScienceHandbook/" target="_blank">https://jakevdp.github.io/PythonDataScienceHandbook/</a>  