# Excel File Manipulation with pandas

Pandas offers to work with Excel files:
- the **read_excel** function and the **ExcelFile** class for reading,
- the **to_excel** method and the **ExcelWriter** class for writing Excel files.

# Case Study: Excel Reporting

Background context: In the companion repository, in the sales_data directory, you will find Excel files with fictitious sales transactions for a telecommunication provider selling different plans (Bronze, Silver, Gold) in a few stores throughout the United States. For every month, there are two files, one in the new subfolder for new contracts and one in the existing subfolder for existing customers. As the reports come from different systems, they come in different formats: the new customers are delivered as xlsx files, while the existing customers arrive in the older xls format. Each of the files has up to 10,000 transactions, and our goal is to produce an Excel report that shows the total sales per store and month. 

In [1]:
import pandas as pd

In [2]:
url = ("https://raw.githubusercontent.com/fzumstein/"
                    "python-for-excel/1st-edition/xl/stores.xlsx")

In [3]:
df = pd.read_excel(url)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7 entries, 0 to 6
Data columns (total 6 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  0 non-null      float64
 1   Unnamed: 1  7 non-null      object 
 2   Unnamed: 2  7 non-null      object 
 3   Unnamed: 3  7 non-null      object 
 4   Unnamed: 4  6 non-null      object 
 5   Unnamed: 5  6 non-null      object 
dtypes: float64(1), object(5)
memory usage: 464.0+ bytes


In [5]:
!python sales_report_pandas.py

Reading February.xlsx
Reading August.xlsx
Reading April.xlsx
Reading June.xlsx
Reading October.xlsx
Reading September.xlsx
Reading July.xlsx
Reading January.xlsx
Reading May.xlsx
Reading November.xlsx
Reading December.xlsx
Reading March.xlsx
Reading October.xls
Reading September.xls
Reading April.xls
Reading June.xls
Reading January.xls
Reading July.xls
Reading February.xls
Reading December.xls
Reading August.xls
Reading November.xls
Reading March.xls
Reading May.xls


# Reading and Writing Excel Files with pandas

## The read_excel Function and ExcelFile Class 

In [6]:
df = pd.read_excel(url,
                  sheet_name="2019", skiprows=1, usecols="B:F")
df

Unnamed: 0,Store,Employees,Manager,Since,Flagship
0,New York,10,Sarah,2018-07-20,False
1,San Francisco,12,Neriah,2019-11-02,MISSING
2,Chicago,4,Katelin,2020-01-31,
3,Boston,5,Georgiana,2017-04-01,True
4,Washington DC,3,Evan,NaT,False
5,Las Vegas,11,Paul,2020-01-06,False


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Store      6 non-null      object        
 1   Employees  6 non-null      int64         
 2   Manager    6 non-null      object        
 3   Since      5 non-null      datetime64[ns]
 4   Flagship   5 non-null      object        
dtypes: datetime64[ns](1), int64(1), object(3)
memory usage: 368.0+ bytes


The *Flagship* column should be bool rather than object. We'll write a converter function to fix the data type.

In [8]:
def fix_missing(x):
    return False if x in ["", "MISSING"] else x

In [9]:
df = pd.read_excel(url,
                  sheet_name="2019", skiprows=1, usecols="B:F",
                  converters={"Flagship": fix_missing})
df

Unnamed: 0,Store,Employees,Manager,Since,Flagship
0,New York,10,Sarah,2018-07-20,False
1,San Francisco,12,Neriah,2019-11-02,False
2,Chicago,4,Katelin,2020-01-31,False
3,Boston,5,Georgiana,2017-04-01,True
4,Washington DC,3,Evan,NaT,False
5,Las Vegas,11,Paul,2020-01-06,False


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6 entries, 0 to 5
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype         
---  ------     --------------  -----         
 0   Store      6 non-null      object        
 1   Employees  6 non-null      int64         
 2   Manager    6 non-null      object        
 3   Since      5 non-null      datetime64[ns]
 4   Flagship   6 non-null      bool          
dtypes: bool(1), datetime64[ns](1), int64(1), object(2)
memory usage: 326.0+ bytes


The **read_excel** function also accepts a list of sheet names. To read in all sheets, you would need to provide **sheet_name=None**.

In [11]:
sheets = pd.read_excel(url,
                      sheet_name=["2019", "2020"], skiprows=1, usecols=["Store", "Employees"])
sheets["2019"].head(2)

Unnamed: 0,Store,Employees
0,New York,10
1,San Francisco,12


If the source file doesn't have column headers, set **header=None** and provide them via **names**. Note that **sheet_name** also accepts sheet indices:

In [12]:
df = pd.read_excel(url,
                    sheet_name=0, skiprows=2, skipfooter=3,
                    usecols="B:C,F", header=None,
                    names=["Branch", "Employee_Count", "Is_Flagship"])
df

Unnamed: 0,Branch,Employee_Count,Is_Flagship
0,New York,10,False
1,San Francisco,12,MISSING
2,Chicago,4,


To handle NaN values, use a combination of **na_values** and **keep_default_na**.

In [13]:
df = pd.read_excel(url,
                    sheet_name="2019", skiprows=1, skipfooter=2,
                    usecols="B,C,F", header=None,
                    na_values="MISSING", keep_default_na=False)
df

Unnamed: 0,1,2,5
0,Store,Employees,Flagship
1,New York,10,False
2,San Francisco,12,
3,Chicago,4,
4,Boston,5,True


**ExcelFile** class mostly makes a difference if you want to read in multiple sheets from a file in the legacy *xls* format. It prevents pandas from reading in the whole file multiple times.

In [14]:
with pd.ExcelFile("https://raw.githubusercontent.com/fzumstein/"
"python-for-excel/1st-edition/xl/stores.xls") as f:
    df1 = pd.read_excel(f, "2019", skiprows=1, usecols="B:F", nrows=2)
    df2 = pd.read_excel(f, "2020", skiprows=1, usecols="B:F", nrows=2)
df1

Unnamed: 0,Store,Employees,Manager,Since,Flagship
0,New York,10,Sarah,2018-07-20,False
1,San Francisco,12,Neriah,2019-11-02,MISSING


**ExcelFile** also gives you access to the names of all sheets:

In [15]:
stores = pd.ExcelFile(url)
stores.sheet_names

['2019', '2020', '2019-2020']

## The to_excel Method and ExcelWriter Class 

In [16]:
import numpy as np
import datetime as dt

In [17]:
data=[[dt.datetime(2020,1,1, 10, 13), 2.222, 1, True],
                   [dt.datetime(2020,1,2), np.nan, 2, False],
                   [dt.datetime(2020,1,2), np.inf, 3, True]]
df = pd.DataFrame(data=data,
                 columns=["Dates", "Floats", "Integers", "Booleans"])
df.index.name = "index"
df

Unnamed: 0_level_0,Dates,Floats,Integers,Booleans
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2020-01-01 10:13:00,2.222,1,True
1,2020-01-02 00:00:00,,2,False
2,2020-01-02 00:00:00,inf,3,True


In [18]:
df.to_excel("written_with_pandas.xlsx", sheet_name="Output", startrow=1, startcol=1, 
            index=True, header=True, na_rep="<NA>", inf_rep="<INF>")

To write multiple DataFrames to the same or different sheets, you will need to use the **ExcelWriter** class.

In [19]:
with pd.ExcelWriter("written_with_pandas2.xlsx") as writer: 
    df.to_excel(writer, sheet_name="Sheet1", startrow=1, startcol=1) 
    df.to_excel(writer, sheet_name="Sheet1", startrow=10, startcol=1) 
    df.to_excel(writer, sheet_name="Sheet2")