# Importing & Exporting Data with Pandas

Pandas is capable of reading and writing data in a variety of file formats including CSV, JSON, Excel, and Pickle. A list of all data formats that pandas supports is located [here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

Pandas has  Reader and  Writer functions. The Reader function allows you to read in data, while the Writer function enables you to save data.


In [2]:
import pandas as pd

## CSV

In [3]:
dat_file = 'http://ddc-datascience.s3-website-us-west-1.amazonaws.com/Home_Data.csv'
home_dat = pd.read_csv(dat_file)
home_dat.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [4]:
home_dat.shape

(1460, 81)

In [5]:
# Writing (saving) data
home_dat.to_csv('home_dat2.csv') # Click file icon on left, click three dots and download.
home_dat.to_csv('/content/home_dat2.csv')

In [6]:
ls


home_dat2.csv  [0m[01;34msample_data[0m/


In [7]:
home_dat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

### Your Turn
1. Read in the `Wholesale_Data.csv` file from AWS at http://ddc-datascience.s3-website-us-west-1.amazonaws.com/Wholesale_Data.csv . Save it to a variable called `wholesale`.
1. Look at the first five rows of `wholesale`. Use the `describe()` method on your data frame. Create a histogram of the `Grocery` column of your data.

Extra: calculate the [Coefficient of variation]( https://en.wikipedia.org/wiki/Coefficient_of_variation ).

In [8]:
# Solution


In [9]:
# Solution


## Excel

In Excel, a file consists of a single workbook.  Within a workbook there can be one or more sheets ( sometimes called tabs. )

In [10]:
# Location of Excel file
dat_file_url = 'http://ddc-datascience.s3-website-us-west-1.amazonaws.com/animals.xlsx'
dat_file_url

'http://ddc-datascience.s3-website-us-west-1.amazonaws.com/animals.xlsx'

In [11]:
# Reading in one sheet from a workbook
animal_dat = pd.read_excel( dat_file_url, sheet_name = "Sheet1")
animal_dat

Unnamed: 0,Animal,Age,Color
0,Dog,7,Brown
1,Cat,4,Black
2,Cow,3,Brown
3,Mouse,2,White


In [12]:
# Reading in multiple sheets from a workbook
workbook = pd.ExcelFile( dat_file_url )

# listing the worksheets
workbook.sheet_names


['Sheet1', 'Sheet2', 'Sheet3']

In [13]:
# Reading in a few sheets from a workbook
animals_dat1 = pd.read_excel( workbook, "Sheet1")
animals_dat2 = pd.read_excel( workbook, "Sheet2")
print(animals_dat1)
print()
print(animals_dat2)

  Animal  Age  Color
0    Dog    7  Brown
1    Cat    4  Black
2    Cow    3  Brown
3  Mouse    2  White

  Animal      Name   Sound
0    Dog  Precious    Woof
1    Cat  Midnight    Meow
2    Cow     Spots     Moo
3  Mouse     Fuzzy  Squeek


In [14]:
# Can also read in multiple sheets from a workbook using a loop
with pd.ExcelFile(dat_file_url) as workbook:
  data = {
    sheet_name: pd.read_excel( workbook, sheet_name)
      for sheet_name in workbook.sheet_names
  }

data.keys()


dict_keys(['Sheet1', 'Sheet2', 'Sheet3'])

In [15]:
data["Sheet1"]

Unnamed: 0,Animal,Age,Color
0,Dog,7,Brown
1,Cat,4,Black
2,Cow,3,Brown
3,Mouse,2,White


In [16]:
data['Sheet2']

Unnamed: 0,Animal,Name,Sound
0,Dog,Precious,Woof
1,Cat,Midnight,Meow
2,Cow,Spots,Moo
3,Mouse,Fuzzy,Squeek


In [17]:
data['Sheet3']

Unnamed: 0,Animal,Info
0,Dog,"[{""id"": 54, ""name"":""Precious""}]"
1,Cat,"[{""id"": 24, ""name"":""Midnight""}]"
2,Cow,"[{""id"": 32, ""name"":""Spots""}]"
3,Mouse,"[{""id"": 58, ""name"":""Fuzzy""}]"


### Your Turn
1. Read in Sheet 3 from the `animals.xlsx` file. Save it to a variable called `animals_info`.
2. Look at `animals_info`.
3. Add a column to `animals_info` called `Weight` that gives the weight for each animal.
4. Save your updated `animals_info` data frame as `animals_updated.xlsx`.

In [18]:
# Solution
workbook = pd.ExcelFile( dat_file_url )
animals_info = pd.read_excel( workbook, "Sheet3")
animals_info.shape

(4, 2)

In [19]:
# Solution
animals_info

Unnamed: 0,Animal,Info
0,Dog,"[{""id"": 54, ""name"":""Precious""}]"
1,Cat,"[{""id"": 24, ""name"":""Midnight""}]"
2,Cow,"[{""id"": 32, ""name"":""Spots""}]"
3,Mouse,"[{""id"": 58, ""name"":""Fuzzy""}]"


In [20]:
# Solution
animals_info["weight"] = [ 50, 10, 400, 0.2 ]
animals_info

Unnamed: 0,Animal,Info,weight
0,Dog,"[{""id"": 54, ""name"":""Precious""}]",50.0
1,Cat,"[{""id"": 24, ""name"":""Midnight""}]",10.0
2,Cow,"[{""id"": 32, ""name"":""Spots""}]",400.0
3,Mouse,"[{""id"": 58, ""name"":""Fuzzy""}]",0.2


In [23]:
# Solution
animals_info.to_excel( "animals_updated.xlsx", index=False )

In [22]:
ls


animals_updated.xlsx  home_dat2.csv  [0m[01;34msample_data[0m/


## HTML
Reading in html can be useful if there is a table of data on a website that you want to parse.

In [None]:
# url = "https://en.wikipedia.org/wiki/New_Mexico"
url = "https://en.wikipedia.org/w/index.php?title=New_Mexico&oldid=1250070720"
nm_tables = pd.read_html(url)
len(nm_tables) # nm_table contains all of the tables on this webpage

34

In [None]:
nm_tables[5]

Unnamed: 0_level_0,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate
Unnamed: 0_level_1,Unnamed: 0_level_1,Rank,Name,County,Pop.,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
0,Albuquerque Las Cruces,1,Albuquerque,Bernalillo,558545.0,Rio Rancho Santa Fe,,,,
1,Albuquerque Las Cruces,2,Las Cruces,Doña Ana,101712.0,Rio Rancho Santa Fe,,,,
2,Albuquerque Las Cruces,3,Rio Rancho,Sandoval / Bernalillo,96159.0,Rio Rancho Santa Fe,,,,
3,Albuquerque Las Cruces,4,Santa Fe,Santa Fe,83776.0,Rio Rancho Santa Fe,,,,
4,Albuquerque Las Cruces,5,Roswell,Chaves,47775.0,Rio Rancho Santa Fe,,,,
5,Albuquerque Las Cruces,6,Farmington,San Juan,45450.0,Rio Rancho Santa Fe,,,,
6,Albuquerque Las Cruces,7,Clovis,Curry,38962.0,Rio Rancho Santa Fe,,,,
7,Albuquerque Las Cruces,8,Hobbs,Lea,37764.0,Rio Rancho Santa Fe,,,,
8,Albuquerque Las Cruces,9,Alamogordo,Otero,31248.0,Rio Rancho Santa Fe,,,,
9,Albuquerque Las Cruces,10,Carlsbad,Eddy,28774.0,Rio Rancho Santa Fe,,,,


In [None]:
my_df = nm_tables[5].iloc[0:10,1:5].copy()
my_df

Unnamed: 0_level_0,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate,Largest cities or towns in New Mexico Source: 2017 U.S. Census Bureau estimate
Unnamed: 0_level_1,Rank,Name,County,Pop.
0,1,Albuquerque,Bernalillo,558545.0
1,2,Las Cruces,Doña Ana,101712.0
2,3,Rio Rancho,Sandoval / Bernalillo,96159.0
3,4,Santa Fe,Santa Fe,83776.0
4,5,Roswell,Chaves,47775.0
5,6,Farmington,San Juan,45450.0
6,7,Clovis,Curry,38962.0
7,8,Hobbs,Lea,37764.0
8,9,Alamogordo,Otero,31248.0
9,10,Carlsbad,Eddy,28774.0


In [None]:
for i, table in enumerate(nm_tables):
  print(f"Table index: {i}")
  print(table.columns)
  print()

Table index: 0
Index([0, 1], dtype='int64')

Table index: 1
Index(['New Mexico  Nuevo México (Spanish)Yootó Hahoodzo (Navajo)', 'New Mexico  Nuevo México (Spanish)Yootó Hahoodzo (Navajo).1'], dtype='object')

Table index: 2
MultiIndex([('List of state symbols',   'Living insignia'),
            ('List of state symbols', 'Living insignia.1')],
           )

Table index: 3
MultiIndex([('Climate data for New Mexico', 'Month'),
            ('Climate data for New Mexico',   'Jan'),
            ('Climate data for New Mexico',   'Feb'),
            ('Climate data for New Mexico',   'Mar'),
            ('Climate data for New Mexico',   'Apr'),
            ('Climate data for New Mexico',   'May'),
            ('Climate data for New Mexico',   'Jun'),
            ('Climate data for New Mexico',   'Jul'),
            ('Climate data for New Mexico',   'Aug'),
            ('Climate data for New Mexico',   'Sep'),
            ('Climate data for New Mexico',   'Oct'),
            ('Climate data for N

In [None]:
# To only get one table, use the match argument
nm_lang_tables = pd.read_html(url, match = 'English only')
len(nm_lang_tables)

1

In [None]:
nm_lang_tables

[              0    1
 0  English only  64%
 1       Spanish  28%
 2        Navajo   4%
 3        Others   4%]

In [None]:
nm_lang_df = nm_lang_tables[0]
nm_lang_df.shape


(4, 2)

In [None]:
nm_lang_df

Unnamed: 0,0,1
0,English only,64%
1,Spanish,28%
2,Navajo,4%
3,Others,4%


In [None]:
nm_lang_df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       4 non-null      object
 1   1       4 non-null      object
dtypes: object(2)
memory usage: 192.0+ bytes
