# Pandas 

> <h> Chapter 1. Preparing data </h>

1.1 Reading multiple datafiles

1.2 Reindexing Dataframes

1.3 Arithmetic with Series & DataFrames


In [2]:
# importing the libraires
import pandas as pd

### read the dataframe

The primary tool we can use for data import is read_csv. This function accepts the file path of a comma-separated values(CSV) file as input and returns a panda’s data frame directly.

In [3]:
# read the (sales-jan-2015) dataset.

### START CODE HERE : 
dataframe0=pd.read_csv("sales-jan-2015.csv",index_col=0)
### END CODE

pandas has other convenient tools with similar default calling syntax that import various data formats into data frames:

1. pd.read_excel() #for importing excel files
2. pd.read_html() #for importing html data
3. pd.read_json() #for importing json data


To read multiple files using pandas, we generally need separate data frames.

In [4]:
# read the datasets sepatetly (sales-jan-2015 and sales-feb-2015)

### START CODE HERE : 
dataframe0=pd.read_csv("sales-jan-2015.csv",index_col=0)
dataframe1=pd.read_csv("sales-feb-2015.csv",index_col=0)
### END CODE

### Using a loop

It’s generally more efficient to iterate over a collection of file names. With that goal, we can create a list of filenames with the two file parts from before. We then initialize an empty list called dataframes and iterate through the list of filenames. Within each iteration we invoke read_csv to read a dataframe from a file and we append the resulting data frame to the dataframes list.

In [39]:
# read the datasets ("sales-feb-2015","sales-jan-2015") using a loop

### START CODE HERE : 
filenames=["sales-jan-2015.csv", "sales-feb-2015.csv"]
dataframes=[]
for f in filenames:
    dataframes.append(pd.read_csv(f))
print(dataframes)
### END CODE

[     Product  Units
0   Hardware     11
1    Service      8
2   Hardware     17
3   Hardware     16
4   Hardware     11
5   Software     18
6   Software      1
7    Service      6
8    Service      7
9    Service     19
10  Hardware     17
11   Service     13
12  Hardware     12
13  Software     14
14   Service     16
15  Software     16
16  Hardware      7
17   Service     18
18  Software     13
19   Service      8,      Product  Units
0    Service      4
1   Software     10
2   Software     13
3   Software      3
4    Service     10
5   Software     19
6    Service     19
7   Software      7
8   Hardware     14
9   Software      7
10  Hardware      1
11  Software      4
12   Service      1
13   Service     10
14  Software     13
15   Service     10
16  Hardware     16
17  Hardware      9
18  Software      3
19  Hardware      3]


### Using glob

When many file names have a similar pattern, that glob module from the Python Standard Library is very useful.
We use the pattern sales*.csv to match any strings that start with the prefix sales and end with the suffix .csv. 

In [22]:
# read the datasets ("sales-jan-2015" ) using a glob

### START CODE HERE : 
from glob import glob
a=["sales-jan-2015.csv"]
filenames=glob("sales*.csv")
dataframes=[pd.read_csv(a) for a in filenames]
### END CODE

### Reindexing dataframe

~ Sorting dataframe with index and columns. 

~ Reindexing dataframe from a list. 

In [23]:
# Create a dataframe
data = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'], 
        'year': [2012, 2012, 2013, 2014, 2014], 
        'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data)

In [24]:
#sort the table(data) with columns

### START CODE HERE : 
print("Sorting the table according to the county:")
df.sort_values("county",ascending=True)
### END CODE

Sorting the table according to the county:


Unnamed: 0,county,year,reports
0,Cochice,2012,4
3,Maricopa,2014,2
1,Pima,2012,24
2,Santa Cruz,2013,31
4,Yuma,2014,3


In [25]:
#sort the table(data) with columns

### START CODE HERE : 
print("Sorting the table according to the year:")
df.sort_values("year",ascending=True)
### END CODE

Sorting the table according to the year:


Unnamed: 0,county,year,reports
0,Cochice,2012,4
1,Pima,2012,24
2,Santa Cruz,2013,31
3,Maricopa,2014,2
4,Yuma,2014,3


In [26]:
#sort the table(data) with columns

### START CODE HERE : 
print("Sorting the table according to the reports:")
df.sort_values("reports",ascending=True)
### END CODE

Sorting the table according to the reports:


Unnamed: 0,county,year,reports
3,Maricopa,2014,2
4,Yuma,2014,3
0,Cochice,2012,4
1,Pima,2012,24
2,Santa Cruz,2013,31


In [27]:
# sorting table with index

### START CODE HERE : 
df.sort_index()
### END CODE

Unnamed: 0,county,year,reports
0,Cochice,2012,4
1,Pima,2012,24
2,Santa Cruz,2013,31
3,Maricopa,2014,2
4,Yuma,2014,3


In [28]:
# Change the order (the index) of the rows

### START CODE HERE : 
df.set_index("county")
### END CODE

Unnamed: 0_level_0,year,reports
county,Unnamed: 1_level_1,Unnamed: 2_level_1
Cochice,2012,4
Pima,2012,24
Santa Cruz,2013,31
Maricopa,2014,2
Yuma,2014,3


In [29]:
# Change the order (the index) of the rows

### START CODE HERE : 
df.set_index("year")
### END CODE

Unnamed: 0_level_0,county,reports
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2012,Cochice,4
2012,Pima,24
2013,Santa Cruz,31
2014,Maricopa,2
2014,Yuma,3


In [30]:
# reindexing from the dataframe index

### START CODE HERE : 
df_new=df.reindex(df)
print(df_new)
### END CODE

                       county  year  reports
(Cochice, 2012, 4)        NaN   NaN      NaN
(Pima, 2012, 24)          NaN   NaN      NaN
(Santa Cruz, 2013, 31)    NaN   NaN      NaN
(Maricopa, 2014, 2)       NaN   NaN      NaN
(Yuma, 2014, 3)           NaN   NaN      NaN


### Arithmetic with series and dataframes


In [31]:
# load the date sets separately(sales-feb-2015.csv & sales-jan-2015.csv)

### START CODE HERE : 
dataframe0=pd.read_csv("sales-jan-2015.csv",index_col=0)
dataframe1=pd.read_csv("sales-feb-2015.csv",index_col=0)
### END CODE

In [32]:
#display the first five rows of data

### START CODE HERE : 
print("Sales in January, 2015:")
dataframe0.head()
### END CODE

Sales in January, 2015:


Unnamed: 0_level_0,Units
Product,Unnamed: 1_level_1
Hardware,11
Service,8
Hardware,17
Hardware,16
Hardware,11


In [33]:
#display the first five rows of data

### START CODE HERE : 
print("Sales in February, 2015:")
dataframe1.head()
### END CODE

Sales in February, 2015:


Unnamed: 0_level_0,Units
Product,Unnamed: 1_level_1
Service,4
Software,10
Software,13
Software,3
Service,10


In [34]:
#mean value of both the data set

### START CODE HERE : 
print("Average sales in January,2015:")
dataframe0_mean=dataframe0.mean()
print(dataframe0_mean)
### END CODE

Average sales in January,2015:
Units    12.4
dtype: float64


In [35]:
#mean value of both the data set

### START CODE HERE : 
print("Average sales in February,2015:")
dataframe1_mean=dataframe1.mean()
print(dataframe1_mean)
### END CODE

Average sales in February,2015:
Units    8.8
dtype: float64


In [36]:
# percentage change to 100%

### START CODE HERE : 
dataframe0.pct_change()*100
### END CODE

Unnamed: 0_level_0,Units
Product,Unnamed: 1_level_1
Hardware,
Service,-27.272727
Hardware,112.5
Hardware,-5.882353
Hardware,-31.25
Software,63.636364
Software,-94.444444
Service,500.0
Service,16.666667
Service,171.428571


In [37]:
# percentage change to 100%

### START CODE HERE : 
dataframe1.pct_change()*100
### END CODE

Unnamed: 0_level_0,Units
Product,Unnamed: 1_level_1
Service,
Software,150.0
Software,30.0
Software,-76.923077
Service,233.333333
Software,90.0
Service,0.0
Software,-63.157895
Hardware,100.0
Software,-50.0


In [38]:
# Adding sales1 and sales2 by using the add function

### START CODE HERE : 
dataframe0.add(dataframe1)
### END CODE

Unnamed: 0_level_0,Units
Product,Unnamed: 1_level_1
Hardware,25
Hardware,12
Hardware,27
Hardware,20
Hardware,14
...,...
Software,20
Software,20
Software,17
Software,26
