# Hello, Jupyter

Jupyter is a popular environment for working with Python.

At a high level, it consists of **cells** which can contain text (like this one)... 

In [12]:
# ... or code, like this one.
# Go ahead and print('Hello from Jupyter') below.
# By the way, these hashtags represent human-read comments,
# not machine-read code. 


For a more in-depth look at working with Jupyter notebooks, check out the course resources in the conclusion.

# Reading spreadsheet data into Python

You will usually start working with data in Python by importing it from an external source. 

The `read_excel()` function from `pandas` will be helpful for reading worksheet data into Python. 

For more about working with `pandas`, check out the recommended resources in the conclusion. 

## Demo: `superstore.xlsx`

This workbook contains three worksheets. Let's read each of them into Python and perform some data analysis.

In [3]:
# Import pandas
import pandas as pd

# Read in our worksheet with read_excel()
orders = pd.read_excel("superstore.xlsx")

# Sneak peek of the data with head()
orders.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,7981,CA-2011-103800,2013-01-03,2013-01-07,Standard Class,DP-13000,Darren Powers,Consumer,United States,Houston,...,77095,Central,OFF-PA-10000174,Office Supplies,Paper,"Message Book, Wirebound, Four 5 1/2"" X 4"" Form...",16.448,2,0.2,5.5512
1,740,CA-2011-112326,2013-01-04,2013-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-LA-10003223,Office Supplies,Labels,Avery 508,11.784,3,0.2,4.2717
2,741,CA-2011-112326,2013-01-04,2013-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-ST-10002743,Office Supplies,Storage,SAFCO Boltless Steel Shelving,272.736,3,0.2,-64.7748
3,742,CA-2011-112326,2013-01-04,2013-01-08,Standard Class,PO-19195,Phillina Ober,Home Office,United States,Naperville,...,60540,Central,OFF-BI-10004094,Office Supplies,Binders,GBC Standard Plastic Binding Systems Combs,3.54,2,0.8,-5.487
4,1760,CA-2011-141817,2013-01-05,2013-01-12,Standard Class,MB-18085,Mick Brown,Consumer,United States,Philadelphia,...,19143,East,OFF-AR-10003478,Office Supplies,Art,Avery Hi-Liter EverBold Pen Style Fluorescent ...,19.536,3,0.2,4.884


By default, `pandas.read_excel()` reads in the active worksheet in the workbook. If we want to read in others, we will specify a second argument, `sheet_name`.    

In [5]:
# Read in all three worksheets this time
orders = pd.read_excel("superstore.xlsx", sheet_name='orders')
returns = pd.read_excel("superstore.xlsx", sheet_name='returns')
people = pd.read_excel("superstore.xlsx", sheet_name='people')

# Preview all three `pandas` DataFrames
print(orders.head())
print(returns.head())
print(people.head())

Row ID        Order ID Order Date  Ship Date       Ship Mode Customer ID  \
0    7981  CA-2011-103800 2013-01-03 2013-01-07  Standard Class    DP-13000   
1     740  CA-2011-112326 2013-01-04 2013-01-08  Standard Class    PO-19195   
2     741  CA-2011-112326 2013-01-04 2013-01-08  Standard Class    PO-19195   
3     742  CA-2011-112326 2013-01-04 2013-01-08  Standard Class    PO-19195   
4    1760  CA-2011-141817 2013-01-05 2013-01-12  Standard Class    MB-18085   

   Customer Name      Segment        Country          City  ... Postal Code  \
0  Darren Powers     Consumer  United States       Houston  ...       77095   
1  Phillina Ober  Home Office  United States    Naperville  ...       60540   
2  Phillina Ober  Home Office  United States    Naperville  ...       60540   
3  Phillina Ober  Home Office  United States    Naperville  ...       60540   
4     Mick Brown     Consumer  United States  Philadelphia  ...       19143   

    Region       Product ID         Category Sub-Cate

In [7]:
# Renaming columns to make them easier to
# operate on in `pandas`
orders.columns = orders.columns.str.lower()
people.columns = people.columns.str.lower()

print(orders.columns)
print(people.columns)

Index(['row id', 'order id', 'order date', 'ship date', 'ship mode',
       'customer id', 'customer name', 'segment', 'country', 'city', 'state',
       'postal code', 'region', 'product id', 'category', 'sub-category',
       'product name', 'sales', 'quantity', 'discount', 'profit'],
      dtype='object')
Index(['person', 'region'], dtype='object')


We will now "look up" the salesperson names into the orders data, find the total sales for each salesperson, and then write that to Excel.

Don't worry too much about the code to manipulate the data in `pandas`.

Instead, focus on the code to read and write data in and out of Excel, the focus of this course.

I will have resources at the conclusion of this book if you would like to learn more about analyzing and manipulating datasets in Python.

In [8]:
# "Look up" the salesperson into the orders data
report = orders.merge(people, on='region', how='left')

# Find total sales by rep
report_agg = report.groupby(['person'])['sales','profit'].sum()

# Preview our report
report_agg.head()

Unnamed: 0_level_0,sales,profit
person,Unnamed: 1_level_1,Unnamed: 2_level_1
Anna Andreadi,725457.8245,108418.4489
Cassandra Brandow,391721.905,46749.4303
Chuck Magee,678781.24,91522.78
Kelly Williams,501239.8908,39706.3625


We can now write the results of `report_agg` to Excel using the `to_excel()` method. We will specify what to call this file. 

In [44]:
# Let's write this to Excel.

report_agg.to_excel("sales-report.xlsx")

By default, our workbook will be written to the same folder as this file. To customize or change that, check out file paths and directory paths in Python.

# DRILL: `baseball.xlsx`

Now it's your turn to read worksheets into `pandas` DataFrames, operate on them, and export the results back to Excel.

I have completed the code to conduct the data manipulation. You finish the code to read and write the data. 

In [None]:

# Import pandas. We will need it for the data manipulation
import pandas as pd


#  Read the `teams`, `salaries` and `people` worksheets 
#  into DataFrames of the same names.
teams = pd.read_excel(___, ___='teams')
salaries = pd.read_excel(___, ___=___)
people = pd.read_excel(___, ___=___)

In [57]:
# "Look up" first names and 
# last names from the people table into the
# salaries table. 
salaries_report = salaries.merge(people[['playerID','nameFirst','nameLast']],on='playerID',how='left')

# Find total salaries by player.
# This line is completed for you to run. 
salaries_agg = salaries_report.groupby(['playerID','nameFirst','nameLast'])['salary'].sum()

# Preview our report
salaries_agg.head()

In [71]:
# 4. Write this DataFrame to an Excel file
# called `salaries-report.xlsx`
# We will also sort the values before doing so.
___.sort_values(ascending=False).___(___)

Congrats on moving Excel data in and out of Python using `pandas`! Now, let's look at another, more versatile way for producing Excel reports from Python.