# Pandas

### Working with DataSet using Pandas
- Pandas is an Open Source Library used to work with tabular data.
- For Data Analysis in Python
- We will be using for reading and writing data between in-memory data structures and files - CSV's, Tex Files, SQL Databases, Excel Sheets etc.
- Reshaping, Slicing, Indexing, Merging and Joining Datasets

### Installation 
`pip install pandas`  

In [2]:
import pandas as pd, numpy as np, matplotlib.pyplot as plt

In [77]:
# Create a dataframe(dataframe is a kind of table with some headers)
# create a dictionary of data. We can convert any dictionary to dataframe

user_data = {
    "Marks1" : np.random.randint(1,100,10),  #create 10 random integers in range 1-100
    "Marks2" : np.random.randint(40,100,10),  #here marks are in range 40-100
    "Marks3" : np.random.randint(50,100,10)  #here marks are in range 50-100
}
print(user_data)

{'Marks1': array([31, 64, 69, 77, 59, 35, 78, 61, 40, 31]), 'Marks2': array([90, 79, 87, 60, 67, 71, 47, 91, 53, 86]), 'Marks3': array([99, 74, 56, 59, 96, 98, 93, 87, 96, 53])}


In [78]:
#DataFrame is a class and here passing args will call it's __init__() method
df = pd.DataFrame(user_data)  # our dictionary is passed as arg to DataFrame class
print(df)
# so 'df' is object of class DataFrame() that we created here and 'df' can use methods of
# DataFrame class now.

   Marks1  Marks2  Marks3
0      31      90      99
1      64      79      74
2      69      87      56
3      77      60      59
4      59      67      96
5      35      71      98
6      78      47      93
7      61      91      87
8      40      53      96
9      31      86      53


In [79]:
df.head()  

Unnamed: 0,Marks1,Marks2,Marks3
0,31,90,99
1,64,79,74
2,69,87,56
3,77,60,59
4,59,67,96


In [80]:
df.tail(n=3)

Unnamed: 0,Marks1,Marks2,Marks3
7,61,91,87
8,40,53,96
9,31,86,53


In [83]:
df   # this is our dataframe

Unnamed: 0,Marks1,Marks2,Marks3
0,31,90,99
1,64,79,74
2,69,87,56
3,77,60,59
4,59,67,96
5,35,71,98
6,78,47,93
7,61,91,87
8,40,53,96
9,31,86,53


In [85]:
print(df.columns) # returns the array of columns

Index(['Marks1', 'Marks2', 'Marks3'], dtype='object')


#### df.index attribute:
- It stores the array containig all the index values.

In [86]:
df.index  #in this dataframe, index is autom called through RangeIndex function, not user-defined

RangeIndex(start=0, stop=10, step=1)

In [90]:
a = df.values    # df.values stores a 2d numpy array in this case
print(a)   
print(type(a))
print(a.shape)  

[[31 90 99]
 [64 79 74]
 [69 87 56]
 [77 60 59]
 [59 67 96]
 [35 71 98]
 [78 47 93]
 [61 91 87]
 [40 53 96]
 [31 86 53]]
<class 'numpy.ndarray'>
(10, 3)


In [91]:
# now we can acccess elements simply as 2d array.
# But dataframe elements can only be accessed by df.iloc[] and can't be accessed direcly like this
a[6][2]

93

- We can also convert this numpy array to dataframe back

In [97]:
df = pd.DataFrame(a, dtype='int32', columns= ['Physics', 'Maths', 'Chem'])

In [98]:
df.to_csv('06.2test_marks.csv')

In [33]:
my_data = pd.read_csv('06.2test_marks.csv')
my_data.head()

Unnamed: 0.1,Unnamed: 0,Marks1,Marks2,Marks3
0,0,42,68,60
1,1,1,95,93
2,2,1,80,78
3,3,27,52,71
4,4,17,90,80


In [34]:
my_data.drop(columns=['Unnamed: 0'])    # to delete column with name 'Unnamed: 0'

Unnamed: 0,Marks1,Marks2,Marks3
0,42,68,60
1,1,95,93
2,1,80,78
3,27,52,71
4,17,90,80
5,76,98,75
6,95,48,56
7,29,86,82
8,98,78,54
9,38,49,78


In [96]:
df.to_csv('06.2test_marks.csv', index=False)  #default value of index=True
#passing this argument will not create index colimn in csv file

## Pandas Basics Part-2

- Let us recreate our data to reduce the size of the data

In [103]:
user_data = {
    "Marks1" : np.random.randint(1,100,5),
    "Marks2" : np.random.randint(40,100,5),
    "Marks3" : np.random.randint(50,100,5)
}
df = pd.DataFrame(user_data)
df.head()

Unnamed: 0,Marks1,Marks2,Marks3
0,19,44,57
1,97,41,57
2,80,71,85
3,60,90,64
4,4,62,86


In [105]:
df.describe()

Unnamed: 0,Marks1,Marks2,Marks3
count,5.0,5.0,5.0
mean,52.0,61.6,69.8
std,39.579035,20.181675,14.618481
min,4.0,41.0,57.0
25%,19.0,44.0,57.0
50%,60.0,62.0,64.0
75%,80.0,71.0,85.0
max,97.0,90.0,86.0


In [107]:
df.iloc[0]  # it shows us the data of first row  

Marks1    19
Marks2    44
Marks3    57
Name: 0, dtype: int64

In [110]:
# to access some element from dataframe (or) to access rows and column values:
df.iloc[0,2]  # It shows element at 1sst row and 3rd column

57

In [111]:
df.iloc[0][2]  # It is same as above

57

In [113]:
df  #this is our original data

Unnamed: 0,Marks1,Marks2,Marks3
0,19,44,57
1,97,41,57
2,80,71,85
3,60,90,64
4,4,62,86


In [114]:
idx = df.columns.get_loc('Marks1')
print(idx)
df.iloc[0][idx]  # we can also use this idx to find out some element now.

0


19

#### Slicing on df
- We can also access multiple rows or multiple cols or both from a dataframe

In [116]:
df.iloc[0,:2]  # it shows us 0th row and (0-1) cols [as last is excluded in slicing]

Marks1    19
Marks2    44
Name: 0, dtype: int64

In [117]:
idx = [df.columns.get_loc('Marks1'), df.columns.get_loc('Marks1')] #list of indexes
df.iloc[0, idx]  #pass list of idx as column index

Marks1    19
Marks1    19
Name: 0, dtype: int64

In [118]:
# print first 3 rows
df.iloc[:3, idx]   # (or) 'df.iloc[:3, :2]' (or)  'df.iloc[:3, [1,2]]'

Unnamed: 0,Marks1,Marks1.1
0,19,19
1,97,97
2,80,80


In [119]:
df.sort_values(by='Marks1')

Unnamed: 0,Marks1,Marks2,Marks3
4,4,62,86
0,19,44,57
3,60,90,64
2,80,71,85
1,97,41,57


In [120]:
#sort according to marks1 by descending order
df.sort_values(by='Marks1', ascending=False)  #default value of ascending is True. See documentat.

Unnamed: 0,Marks1,Marks2,Marks3
1,97,41,57
2,80,71,85
3,60,90,64
0,19,44,57
4,4,62,86


In [121]:
df.sort_values(by=['Marks3', 'Marks1'], ascending = False) 

Unnamed: 0,Marks1,Marks2,Marks3
4,4,62,86
2,80,71,85
3,60,90,64
1,97,41,57
0,19,44,57


In [125]:
chem = df.get('Marks3')
type(chem)

pandas.core.series.Series

In [126]:
chem = list(chem)
type(chem)

list