# Pandas Tutorial 4: Reading and Writing CSV and Excel Files in Pandas

In this tutorial, we'll focus on reading from and writing to CSV and Excel files - essential tasks in data analysis. You'll learn how to handle messy data, transform it during import, and export portions of a DataFrame to Excel.

**Topics covered:**
- Reading a CSV with `read_csv()`
- Skipping rows using `skiprows`
- Importing a CSV with missing headers
- Reading a limited number of rows
- Cleaning messy data using `na_values`
- Replacing values with `na_values` (dictionary)
- Writing to CSV with `to_csv()`
- Reading an Excel file with `read_excel()`
- Using `converters` in `read_excel()`
- Writing to Excel with `to_excel()`
- Using `ExcelWriter()` for advanced Excel operations

This tutorial will build on your DataFrame knowledge and introduce real-world data handling with CSV and Excel files, a crucial part of any data analysis workflow.

In [1]:
import pandas as pd

### Reading a CSV File with Limited Rows Using `nrows`

The `nrows` argument in `read_csv()` allows you to load a specified number of rows from a CSV file. For example, loading only the first 3 rows is useful when working with large datasets and you only need a sample.

**Key features of `nrows`**:
- Optimizes memory usage by limiting rows.
- Ideal for quick inspection or testing code on a small subset before processing the entire dataset.

In [2]:
# Read only the first 3 rows from the CSV file
df = pd.read_csv("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\stock_data.csv", nrows=3)
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1.0,85,64,bill gates


### Handling Missing Data with `na_values` in `read_csv()` 

The `na_values` argument in `read_csv()` allows you to specify custom strings or values that should be treated as missing (`NaN`). For example, "not available" and "n.a" in the CSV will be converted to `NaN`.

**Key features of `na_values`**:
- Allows defining custom representations of missing data.
- Ensures consistent handling of missing or invalid data in the DataFrame.

In [3]:
# Converts "missing" and "unknown" to NaN in the DataFrame
df = pd.read_csv("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\stock_data.csv", na_values=["not available","n.a"])
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1.0,85,64,bill gates
3,RIL,,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


### Using a Dictionary with `na_values` in `read_csv()`

The `na_values` argument can accept a dictionary, allowing you to define custom missing value markers for each column. For example, the `eps` column treats "not available" and "n.a." as `NaN`, while the `revenue` column also treats `-1` as missing.

**Key Pointers**:
- Offers precise control over missing data for specific columns.
- Ideal for handling columns with different missing data representations.

In [4]:
df = pd.read_csv("stock_data.csv", na_values={
    'eps': ["not available","n.a."], # Specifies "not available" and "n.a." as NaN for the 'eps' column
    'revenue': ["not available","n.a.",-1], # Specifies "not available", "n.a.", and -1 as NaN for the 'revenue' column
})
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87.0,845,larry page
1,WMT,4.61,484.0,65,n.a.
2,MSFT,-1.0,85.0,64,bill gates
3,RIL,,50.0,1023,mukesh ambani
4,TATA,5.6,,n.a.,ratan tata


### Writing a DataFrame to a CSV File with `to_csv()`

The `to_csv()` method exports a DataFrame to a CSV file. Using `index=False` ensures the index is not written as a separate column.

**Key features**:
- Easily saves a DataFrame to a CSV file.
- `index=False` prevents the index from being written, useful when it's not needed.
- Customize the output with options like delimiter and more.

In [6]:
# Writes the DataFrame df to a CSV file named "new.csv" at the specified file path, without saving the index as a column
df.to_csv('C:\\Users\\Vaishob\\PycharmProjects\\pandas\\new.csv', index=False)

In [7]:
 # Returns the column labels of the DataFrame df as an Index object
df.columns

Index(['tickers', 'eps', 'revenue', 'price', 'people'], dtype='object')

### Exporting Specific Columns with `to_csv()`

The `to_csv()` method allows you to export specific columns using the `columns` argument. For example, only the 'tickers' and 'eps` columns are written to the CSV.

**Key features**:
- Enables selective export of columns.
- Saves memory and storage by exporting only the needed data.

In [10]:
# Writes only the 'tickers' and 'eps' columns of the DataFrame df to CSV file
df.to_csv("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\new.csv",columns=['tickers','eps'])

### Exporting a DataFrame to CSV Without Column Headers

Using `header=False` in the `to_csv()` method prevents column headers from being written to the CSV file, which is useful for appending data or when headers aren't needed.

**Key features**:
- Provides flexibility when exporting data without headers.
- Ideal for appending to existing files without duplicating column names.

In [12]:
# Exports the DataFrame without column headers
df.to_csv("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\new.csv",header=False)

### Using the `converters` Argument in `read_excel()`

The `converters` argument in `read_excel()` applies custom functions to specific columns while reading an Excel file. For example:
* `convert_people_cell` replaces `"n.a."` with `'sam walton'` in the 'people' column.
* `convert_eps_cell` replaces `"not available"` with `None` in the 'eps' column.

**Key features**:
- Offers flexibility in transforming column values during import.
- Allows custom handling of missing or invalid data when loading the file.

In [15]:
# Replace "n.a." with 'sam walton' in the 'people' column
def convert_people_cell(cell):
    if cell=="n.a.":
        return 'sam walton'
    return cell

# Replace "not available" with None in the 'eps' column
def convert_eps_cell(cell):
    if cell=="not available":
        return None
    return cell

df = pd.read_excel("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\stock_data.xlsx", "Sheet1", converters = {
    # Reads the Excel file "stock_data.xlsx" from "Sheet1" and applies custom converters to 'people' and 'eps' columns
        'people': convert_people_cell,
        'eps': convert_eps_cell
    })
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,sam walton
2,MSFT,-1,85,64,bill gates
3,RIL,not available,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


### Writing a DataFrame to Excel with Custom Options

The `to_excel()` method exports a DataFrame to an Excel file with custom options:
* `sheet_name="stocks"`: Sets the sheet name.
* `startrow=1`, `startcol=2`: Begins writing at the second row and third column.
* `index=False`: Excludes the index from the output.

**Key features**:
- Enables flexible placement of data in the Excel sheet.
- Useful for appending or leaving space for headers or additional content.

In [17]:
# Writes DataFrame to an Excel file "new.xlsx" in the "stocks" sheet, starting at row 2 and column 3 (0-based indexing), and does not include the index
df.to_excel("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\new.xlsx", sheet_name="stocks", startrow=1, startcol=2, index=False)

### Creating Multiple DataFrames

In this example:
- `df_stocks` contains stock data (tickers, prices, P/E ratios, and EPS).
- `df_weather` contains weather data (date, temperature, and event).

**Key features**:
- Managing multipe DataFrames allows separate analysis of different datasets.
- Ideal for holding diverse data types (e.g. numerical, categorical, dates) in tabular form.

In [18]:
df_stocks = pd.DataFrame({
    'tickers': ['GOOGL', 'WMT', 'MSFT'],
    'price': [845, 65, 64],
    'pe': [30.37, 14.26, 30.97],
    'eps': [27.82, 4.61, 2.12]
})

df_weather = pd.DataFrame({
    'day': ['1/1/2017','1/2/2017','1/3/2017'],
    'temperature': [32,35,28],
    'event': ['Rain', 'Sunny', 'Snow']
})

### Writing Multiple DataFrames to an Excel File

Using `ExcelWriter()`, you can write multiple DataFrames to different sheets in the same Excel file:
* `df_stocks` is written to the "stocks" sheet.
* `df_weather` is written to the "weather" sheet.

**Key features**:
- Allows saving multiple DataFrames to one Excel file with separate sheet names.
- The `with` context manager ensures the file is properly closed after writing.

In [20]:
with pd.ExcelWriter('C:\\Users\\Vaishob\\PycharmProjects\\pandas\\stocks_weather.xlsx') as writer:
    # Opens an ExcelWriter object to write to the file "stocks_weather.xlsx"
    df_stocks.to_excel(writer, sheet_name="stocks") # Writes the df_stocks DataFrame to the "stocks" sheet
    df_weather.to_excel(writer, sheet_name="weather") # Writes the df_weather DataFrame to the "weather" sheet