# Pandas
"*pandas* is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language." - **https://pandas.pydata.org/**

*pandas* provide a number of useful functions for importing and analyzing data. It provides flexible data structures and operations to easily manipulate the data. *pandas* require *NumPy*.

This tutorial demonstrates use pandas to create and manipulate data structures and reading data from csv or txt files.

To use *pandas* we must first load it into our workspace using the following statement
```
import pandas as pd
```

In [2]:
# Import the `pandas` library as `pd`
import pandas as pd
#print version of pandas
pd.__version__


'0.23.0'

## Series
A *series* is a data structure which can hold a number of objects. It's like a one dimentional array.

In [3]:
#create series with pandas
ser1 = pd.Series([1, 2, 3])
print(ser1)
serA = pd.Series(['a', 'b', 'c'])
print(serA)

ser1A = pd.Series(['a', 'b', 'c',22.0])
print(ser1A)
print(ser1A[1])
print(ser1A[3]*2)

print(ser1A.index)
print(ser1A.values)

#series from dict
dict1 = {'a' : 2, 'b' : 1, 'c' : 3}
serD=pd.Series(dict1)
print(serD)
print(serD[0])
print(serD['c'])

print(serD.index)
print(serD.values)

#define your own index
cars = ['NSX', 'R8', 'chiron', '488 GTB']
mpg = [22, 22, 14, 22]
serCars = pd.Series(mpg, index=cars)
print(serCars)
print(serCars[['NSX','chiron']])

0    1
1    2
2    3
dtype: int64
0    a
1    b
2    c
dtype: object
0     a
1     b
2     c
3    22
dtype: object
b
44.0
RangeIndex(start=0, stop=4, step=1)
['a' 'b' 'c' 22.0]
a    2
b    1
c    3
dtype: int64
2
3
Index(['a', 'b', 'c'], dtype='object')
[2 1 3]
NSX        22
R8         22
chiron     14
488 GTB    22
dtype: int64
NSX       22
chiron    14
dtype: int64


## Dataframe
A *dataframe* can hold tabular data with rows and columns. It is logically same as an excel sheet. Each column in a data frame is a series.

In [4]:
#data frame from dict example
d = {'Col1' : pd. Series ([1. , 2., 3.] ,index =[ '1', 'b', 'c']) , 'Col2' : pd. Series ([2. , 9., 4.] ,index =[ 'a', 'b', 'c'])}
df1 = pd. DataFrame (d)
print(df1)

#data frame from series
df2=pd.concat([serA,ser1A], axis=1)
print(df2)

#using lists
states=['AZ','CA','IA','KS','NY']
statesFull=['Arizona','California','Iowa','Kansas','New York']
dfStates=pd. DataFrame(list(zip(states,statesFull)))
print(dfStates)
#change column names
print("After changing col names")
dfStates.columns = ['Abb', 'Name']
print(dfStates)
#see dimentions
print(dfStates.shape)



   Col1  Col2
1   1.0   NaN
a   NaN   2.0
b   2.0   9.0
c   3.0   4.0
     0   1
0    a   a
1    b   b
2    c   c
3  NaN  22
    0           1
0  AZ     Arizona
1  CA  California
2  IA        Iowa
3  KS      Kansas
4  NY    New York
After changing col names
  Abb        Name
0  AZ     Arizona
1  CA  California
2  IA        Iowa
3  KS      Kansas
4  NY    New York
(5, 2)


## Import data
Import .csv data from storage into python workspace using pandas. Then we use pandas functions and tools to explore the dataset.

In [71]:
df = pd.read_csv("data/iris.data.csv")
#df=pd.read_csv("https://raw.githubusercontent.com/urmi-21/python3-dataScience18/master/data/iris.data.csv")
#see df dimentions
print(df.shape)
#see datatypes
print(df.dtypes)
#print first 5 rows
print(df.head(5))

#get data summary
print("Data summary")
print (df.describe())

#print unique class values
df['class'].unique()

#get mean sepallength
print("mean fisrt col "+str(df['sepallength'].mean()))
#refers to first column
print("mean fisrt col "+str(df[df.columns[0]].mean()))
#find mean of sepallength by class
print("mean of sepallength by class: "+str(df.groupby('class')['sepallength'].mean()))
#find mean for each class
print("means by class"+str(df.groupby('class').mean()))

#apply function
import math
def logX(x):
    return math.log(x)
df['logsepallength'] = df['sepallength'].apply(logX)
df['logpetallength'] = df['petallength'].apply(lambda x: math.log(x))
add10=lambda x: x if isinstance(x,object) else x+10
print("addres "+str(add10(10.0)))
print(df.head(5))

#add 10 to all values
df2=df.drop('class', axis=1).apply(lambda x: x+10 )
df2['class']=df['class']
print(df2.head(5))




(150, 5)
sepallength    float64
sepalwidth     float64
petallength    float64
petalwidth     float64
class           object
dtype: object
   sepallength  sepalwidth  petallength  petalwidth        class
0          5.1         3.5          1.4         0.2  Iris-setosa
1          4.9         3.0          1.4         0.2  Iris-setosa
2          4.7         3.2          1.3         0.2  Iris-setosa
3          4.6         3.1          1.5         0.2  Iris-setosa
4          5.0         3.6          1.4         0.2  Iris-setosa
Data summary
       sepallength  sepalwidth  petallength  petalwidth
count   150.000000  150.000000   150.000000  150.000000
mean      5.843333    3.054000     3.758667    1.198667
std       0.828066    0.433594     1.764420    0.763161
min       4.300000    2.000000     1.000000    0.100000
25%       5.100000    2.800000     1.600000    0.300000
50%       5.800000    3.000000     4.350000    1.300000
75%       6.400000    3.300000     5.100000    1.800000
max       7