# Intro to Pandas
by Ryan Orsinger

## Module 5: Working With Files
- More on using `pd.read_csv`
    - HTTP Requests
    - Working with files that use delimiters/separators other than commas
    - Setting the index column
- Writing data with `to_csv`
- Reading JSON
- Reading from Excel files
- Writing to Excel files

In [None]:
import pandas as pd

In [None]:
# read_csv can read from hosted CSV files.
# Pandas sends the http request!
url = "https://gist.githubusercontent.com/ryanorsinger/cc276eea59e8295204d1f581c8da509f/raw/2388559aef7a0700eb31e7604351364b16e99653/mall_customers.csv"
pd.read_csv(url).head()

In [None]:
# To set the index column, use the index_col argument
# If you notice a column that makes sense to use as the index, you'll need to specic
pd.read_csv(url, index_col="customer_id").head()

In [None]:
# The ! operator inside of Jupyter Notebooks or iPython issues a command to the terminal
# If you use Windows without the Linux Subsystem enabled, use !dir *.csv
!ls *.csv

In [None]:
!ls *sales*.csv

In [None]:
sales_files = !ls *sales*.csv
sales_files

In [None]:
# Programmatically Reading Multiple Files 
sales_data = []
for file in sales_files:
    df = pd.read_csv(file)
    sales_data.append(df)
    
sales_df = pd.concat(sales_data, ignore_index=True)
sales_df

In [None]:
# It's common in the field to combine many different data sources into a single dataframe for cleaning/analysis
# Writing to_csv will write the index values to their own column on the data
sales_df.to_csv("all_sales.csv")

In [None]:
!ls *.csv

In [None]:
# Notice how the left-over column is turned into an unnamed column
pd.read_csv("all_sales.csv").head()

In [None]:
# Let's see an example where we avoid this complication by paying more attention to the index
# The index argument on to_csv takes a boolean and defaults to True
sales_df.to_csv("all_sales_clean.csv", index=False)

In [None]:
# Notice that the index is regenerated and is appropriate
pd.read_csv("all_sales_clean.csv")

If you use a named index column instead of only the autogenerated index, you will avoid this.

### Note on Separator Characters, called Delimiters
- CSV files use commas to separate values
- You may encounter files that use another delimiter character than a comma
- Tab separated files are common in logfiles and spreadsheet exports
- Sometimes, you may encounter a file extension of .tsv for tab-separated-values
- You may encounter delimiters other than commas or tabs in plain text files.
- Use `pd.read_csv` for them (unless the file is .JSON), and identify the appropriate character

In [None]:
# The "\t" character is how we specify a tab character
pd.read_csv("penguins_with_tabs.tsv", sep="\t").head()

In [None]:
# The read_json method can read JSON files from the file system or from URLs.
# This is particularly helpful when consuming data from a RESTful API that returns JSON
curie_quotes = pd.read_json("https://aphorisms.glitch.me/api/example")
curie_quotes

## Example of using `read_clipboard`

|     model |             displ | year |  cyl | trans |        drv |  cty |  hwy |   fl | drv   | class   |
| --------: | ----------------: | ---: | ---: | ----: | ---------: | ---: | ---: | ---: | ----: | ------- |
|      audi |                a4 |  2.0 | 2008 |     4 |   auto(av) |    f |   21 |   30 |     p | compact |
|     dodge | dakota pickup 4wd |  3.9 | 1999 |     6 | manual(m5) |    4 |   14 |   17 |     r | pickup  |
|    toyota |       4runner 4wd |  4.7 | 2008 |     8 |   auto(l5) |    4 |   14 |   17 |     r | suv     |
|     dodge |       caravan 2wd |  3.8 | 2008 |     6 |   auto(l6) |    f |   16 |   23 |     r | minivan |
| chevrolet |            malibu |  3.6 | 2008 |     6 |   auto(s6) |    f |   17 |   26 |     r | midsize |


In [None]:
# Highlight and copy the table above 
# Then run this cell
df = pd.read_clipboard()
df

In [None]:
# Writing a dataframe in memory to an excel file
df.to_excel("mpg.xlsx", index=None)

In [None]:
# Reading an excel file (simple version)
mpg = pd.read_excel("mpg.xlsx")

In [None]:
mpg

In [None]:
# Reading a specific sheet from an excel file
pd.read_excel("example_spreadsheet.xlsx", sheet_name="grocery_list")

In [None]:
# Notice how there's some extra
pd.read_excel("example_spreadsheet.xlsx", sheet_name="pet_info")

In [None]:
# Sometimes, you may need to open the spreadsheet to identify the columns to skip
pd.read_excel("example_spreadsheet.xlsx", sheet_name="pet_info", skiprows=4)

## Additional Resources
- https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
- https://pandas.pydata.org/docs/reference/api/pandas.read_clipboard.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_clipboard.html
- https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_excel.html
- Other formats https://pandas.pydata.org/docs/user_guide/io.html
    - SQL
    - XML
    - STATA
    - SAS
    - SPSS