# Pandas Tutorial 4: Reading and Writing CSV and Excel Files in Pandas

Building on the previous tutorial where we explored various ways to create a DataFrame, we now turn our attention to one of the most common tasks in data analysis: reading from and writing to CSV and Excel files. In this tutorial, we will cover how to effectively manage these file formats using Pandas. This includes handling messy data during import, transforming data using converters, and selectively exporting portions of a DataFrame to an Excel file.

**Topics covered:**
- Introduction
- Reading a CSV file using the `read_csv()` method
- Skipping rows in a DataFrame using the `skiprows` argument
- Importing data from a CSV file with a "null header"
- Reading a limited number of rows from a CSV file
- Cleaning messy data (e.g., replacing "not available" and "n.a." with `na_values`)
- Replacing values using a dictionary with the `na_values` argument
- Writing a DataFrame to a CSV file using the `to_csv()` method
- Reading an Excel file using the `read_excel()` method
- Using the `converters` argument in the `read_excel()` method
- Writing a DataFrame to an Excel file using the `to_excel()` method
- Using the `ExcelWriter()` class for more advanced Excel file operations
- Overview of all properties for reading and writing Excel and CSV files

This tutorial will extend your knowledge from creating DataFrames to efficiently handling real-world data stored in CSV and Excel files, which is a key part of any data science or analytics workflow.

### Reading a CSV File with Limited Rows Using `nrows`

The `read_csv()` method allows you to specify how many rows to read from a CSV file using the `nrows` argument. In this case, only the first 3 rows are loaded into the DataFrame, which is useful when working with large datasets where you only need a sample of the data.

**Key features of `nrows`**:
- It helps optimize memory usage by limiting the number of rows read from a file.
- Useful for quick inspection or when testing code on a subset of data before processing the entire dataset.

In [2]:
import pandas as pd
# Read only the first 3 rows from the CSV file
df = pd.read_csv("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\stock_data.csv", nrows=3)
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1,85,64,bill gates
3,RIL,not available,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


### Handling Missing Data with `na_values` in `read_csv()` 

The `na_values` argument in the `read_csv()` method is used to specify additional strings or values that should be treated as missing data (`NaN`) while reading the file. In this case, the strings "not available" and "n.a" in the CSV file will be converted to `NaN` in the DataFrame.

**Key features of `na_values`**:
- It allows you to define custom representations of missing data beyond the default ones.
- Helps ensure consistent handling of missing or invalid data in your DataFrame.

In [3]:
# Converts "missing" and "unknown" to NaN in the DataFrame
df = pd.read_csv("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\stock_data.csv", na_values=["not available","n.a"])
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,n.a.
2,MSFT,-1.0,85,64,bill gates
3,RIL,,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


### Using a Dictionary with `na_values` in `read_csv()`

The `na_values` argument can also accept a dictionary, allowing you to specify different missing value markers for each column. In this example, the `eps` column treats "not available" and "n.a." as missing (`NaN`), while the `revenue` column treats "not available", "n.a.", and `-1` as missing.

**Key Pointers about `na_values`**:
- Provides fine-grained control over how missing data is handled for each specific column.
- Useful when different columns have different representations of missing or invalid data.

In [4]:
df = pd.read_csv("stock_data.csv", na_values={
    'eps': ["not available","n.a."], # Specifies "not available" and "n.a." as NaN for the 'eps' column
    'revenue': ["not available","n.a.",-1], # Specifies "not available", "n.a.", and -1 as NaN for the 'revenue' column
})
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87.0,845,larry page
1,WMT,4.61,484.0,65,n.a.
2,MSFT,-1.0,85.0,64,bill gates
3,RIL,,50.0,1023,mukesh ambani
4,TATA,5.6,,n.a.,ratan tata


### Writing a DataFrame to a CSV File with `to_csv()`

The `to_csv()` method in Pandas is used to export a DataFrame to a CSV file. In this case, the `index=False` argument ensures that the index (row labels) is not written as a separate column in the CSV file.

**Key features of `to_csv()`**:
- Allows you to easily save the contents of a DataFrame to a CSV file.
- The `index=False` argument prevents the index from being written, which is helpful when the index is not relevant for the CSV.
- You can customize the delimiter and other output options with additional arguments.

In [6]:
# Writes the DataFrame df to a CSV file named "new.csv" at the specified file path, without saving the index as a column
df.to_csv('C:\\Users\\Vaishob\\PycharmProjects\\pandas\\new.csv', index=False)

In [7]:
 # Returns the column labels of the DataFrame df as an Index object
df.columns

Index(['tickers', 'eps', 'revenue', 'price', 'people'], dtype='object')

### Exporting Specific Columns with `to_csv()`

The `to_csv()` method allows you to specify which columns to export by using the `columns` argument. In this case, only the 'tickers' and 'eps' columns are written to the CSV file, ignoring all other columns in the DataFrame.

**Key features**:
- Allows for selective export of specific columns from the DataFrame.
- Helps save memory and storage when only a portion of the DataFrame is needed.

In [10]:
# Writes only the 'tickers' and 'eps' columns of the DataFrame df to CSV file
df.to_csv("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\new.csv",columns=['tickers','eps'])

### Exporting a DataFrame to CSV Without Column Headers

The `header=False` argument in the `to_csv()` method prevents the column headers from being written to the CSV file. This is useful when you want to save the data without including the column names, such as when appending to an existing file or when headers are not needed.

**Key features**:
- Allows for more flexibility when exporting data, especially in scenarios where headers are not required.
- Useful for appending data to existing CSV files without repeating the column names.

In [12]:
# Exports the DataFrame without column headers
df.to_csv("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\new.csv",header=False)

### Using the `converters` Argument in `read_excel()`

The `converters` argument in the `read_excel()` method allows you to apply custom functions to specific columns while reading data from an Excel file. In this example:
- The `convert_people_cell` function replaces occurrences of `"n.a."` with `'sam walton'` in the 'people' column.
- The `convert_eps_cell` function replaces `"not available"` with `None` in the 'eps' column.

**Key features**:
- Provides flexibility in transforming specific column values during the data import process.
- Allows for custom handling of missing or invalid data directly when loading the Excel file.

In [15]:
# Replace "n.a." with 'sam walton' in the 'people' column
def convert_people_cell(cell):
    if cell=="n.a.":
        return 'sam walton'
    return cell

# Replace "not available" with None in the 'eps' column
def convert_eps_cell(cell):
    if cell=="not available":
        return None
    return cell

df = pd.read_excel("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\stock_data.xlsx", "Sheet1", converters = {
    # Reads the Excel file "stock_data.xlsx" from "Sheet1" and applies custom converters to 'people' and 'eps' columns
        'people': convert_people_cell,
        'eps': convert_eps_cell
    })
df

Unnamed: 0,tickers,eps,revenue,price,people
0,GOOGL,27.82,87,845,larry page
1,WMT,4.61,484,65,sam walton
2,MSFT,-1,85,64,bill gates
3,RIL,not available,50,1023,mukesh ambani
4,TATA,5.6,-1,n.a.,ratan tata


### Writing a DataFrame to Excel with Custom Options

The `to_excel()` method is used to export a DataFrame to an Excel file. The options used here:
- `sheet_name="stocks"`: Specifies the name of the sheet where the data will be written.
- `startrow=1`: Starts writing the DataFrame at the second row (Excel is 1-based, but Pandas uses 0-based indexing).
- `startcol=2`: Starts writing at the third column (also 0-based indexing).
- `index=False`: Prevents the index from being written to the file.

**Key features**:
- Allows for flexible placement of the DataFrame within the Excel sheet.
- Useful when you want to append data or leave space at the top of the sheet for headers or additional content.

In [17]:
# Writes DataFrame to an Excel file "new.xlsx" in the "stocks" sheet, starting at row 2 and column 3 (0-based indexing), and does not include the index
df.to_excel("C:\\Users\\Vaishob\\PycharmProjects\\pandas\\new.xlsx", sheet_name="stocks", startrow=1, startcol=2, index=False)

### Creating Multiple DataFrames

In this example:
- `df_stocks` holds stock market data with columns for stock tickers, prices, price-to-earnings ratios, and earnings per share.
- `df_weather` holds weather data with columns for the date, temperature, and weather event.

**Key features**:
- Creating multiple DataFrames allows you to manage and analyze different sets of data separately.
- DataFrames can hold diverse types of data (e.g., numerical, categorical, dates) and are ideal for tabular data analysis.

In [18]:
df_stocks = pd.DataFrame({
    'tickers': ['GOOGL', 'WMT', 'MSFT'],
    'price': [845, 65, 64],
    'pe': [30.37, 14.26, 30.97],
    'eps': [27.82, 4.61, 2.12]
})

df_weather = pd.DataFrame({
    'day': ['1/1/2017','1/2/2017','1/3/2017'],
    'temperature': [32,35,28],
    'event': ['Rain', 'Sunny', 'Snow']
})

### Writing Multiple DataFrames to an Excel File

Using `ExcelWriter()`, you can write multiple DataFrames to different sheets within the same Excel file. In this example:
- The `df_stocks` DataFrame is written to the "stocks" sheet.
- The `df_weather` DataFrame is written to the "weather" sheet.

**Key features**:
- Allows for writing multiple DataFrames to a single Excel file with different sheet names.
- The context manager (`with`) ensures that the file is properly closed after the writing process.

In [20]:
with pd.ExcelWriter('C:\\Users\\Vaishob\\PycharmProjects\\pandas\\stocks_weather.xlsx') as writer:
    # Opens an ExcelWriter object to write to the file "stocks_weather.xlsx"
    df_stocks.to_excel(writer, sheet_name="stocks") # Writes the df_stocks DataFrame to the "stocks" sheet
    df_weather.to_excel(writer, sheet_name="weather") # Writes the df_weather DataFrame to the "weather" sheet