# Mastering Pandas read_csv() with Examples – A Tutorial by CodesWithPankaj.com

Introduction:

Pandas, a powerful data manipulation library in Python, has become an essential tool for data scientists and analysts. One of its key functions is `read_csv()`, which allows users to read data from CSV (Comma-Separated Values) files into a Pandas DataFrame. In this tutorial, brought to you by CodesWithPankaj.com, we will explore the intricacies of `read_csv()` with clear examples to help you harness its full potential.

Understanding read_csv():

`read_csv()` is a versatile function that offers various parameters to handle diverse scenarios when reading CSV files. Whether your dataset has a specific delimiter, contains missing values, or requires custom column names, Pandas provides options to accommodate these situations.

##  Importing a CSV file using the read_csv() function

In [1]:
# Importing the Pandas library
import pandas as pd

# Specify the file path or URL of the CSV file
file_path = 'DataSet/p4n_emp.csv'

# Use the read_csv() function to read the CSV file into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print(df.head())


                    name  age     sex  \
0         Aaron, Dayan M   38    Male   
1       Aaron, Freddie L   52    Male   
2        Aaron, Tyrone M   44    Male   
3        Abazenab, Kokeb   42  Female   
4  Abbott, Christopher D   32    Male   

                                       ethnic.origin  \
0  Black or African American (Not Hispanic or Lat...   
1  Black or African American (Not Hispanic or Lat...   
2  Black or African American (Not Hispanic or Lat...   
3  Black or African American (Not Hispanic or Lat...   
4                     White (Not Hispanic or Latino)   

                            job.title  \
0                    ATL311 Team Lead   
1  Environmental Service Worker I (D)   
2       Watershed Crew Supervisor (D)   
3         Benefits Representative, Sr   
4     Recreation Operations Assistant   

                                organization  annual.salary  
0                      EXE Executive Offices       45999.99  
1             DPW Department of Public Works 

## Setting a column as the index

In [1]:
# Importing the Pandas library
import pandas as pd

# Specify the file path of the CSV file
file_path = 'DataSet/p4n_emp.csv'

# Use the read_csv() function to read the CSV file into a DataFrame and set "name" as the index
df = pd.read_csv(file_path, index_col='name')

# Display the first few rows of the DataFrame
print(df.head())

                       age     sex  \
name                                 
Aaron, Dayan M          38    Male   
Aaron, Freddie L        52    Male   
Aaron, Tyrone M         44    Male   
Abazenab, Kokeb         42  Female   
Abbott, Christopher D   32    Male   

                                                           ethnic.origin  \
name                                                                       
Aaron, Dayan M         Black or African American (Not Hispanic or Lat...   
Aaron, Freddie L       Black or African American (Not Hispanic or Lat...   
Aaron, Tyrone M        Black or African American (Not Hispanic or Lat...   
Abazenab, Kokeb        Black or African American (Not Hispanic or Lat...   
Abbott, Christopher D                     White (Not Hispanic or Latino)   

                                                job.title  \
name                                                        
Aaron, Dayan M                           ATL311 Team Lead   
Aaron, Freddie L 

## Selecting specific columns to read into memory

In [3]:
# Importing the Pandas library
import pandas as pd

# Specify the file path of the CSV file
file_path = 'DataSet/p4n_emp.csv'

# Specify the columns you want to select
selected_columns = ['name', 'age', 'sex']

# Use the read_csv() function to read only the specified columns into a DataFrame
df = pd.read_csv(file_path, usecols=selected_columns)

# Display the first few rows of the DataFrame
print(df.head())


                    name  age     sex
0         Aaron, Dayan M   38    Male
1       Aaron, Freddie L   52    Male
2        Aaron, Tyrone M   44    Male
3        Abazenab, Kokeb   42  Female
4  Abbott, Christopher D   32    Male


### DataFrame Methods:

1. **head(n=5)**:
   - Displays the first n rows of the DataFrame.
   ```python
   df.head(10)  # Display the first 10 rows
   ```

2. **tail(n=5)**:
   - Displays the last n rows of the DataFrame.
   ```python
   df.tail(8)  # Display the last 8 rows
   ```

3. **info()**:
   - Provides a concise summary of the DataFrame, including data types and missing values.
   ```python
   df.info()
   ```

4. **describe()**:
   - Generates descriptive statistics of the DataFrame's numerical columns.
   ```python
   df.describe()
   ```

5. **shape**:
   - Returns a tuple representing the dimensions (rows, columns) of the DataFrame.
   ```python
   print(df.shape)
   ```

6. **columns**:
   - Returns an Index object containing the column labels.
   ```python
   print(df.columns)
   ```

7. **unique()**:
   - Returns an array of unique values in a specified column.
   ```python
   unique_values = df['column_name'].unique()
   ```

8. **value_counts()**:
   - Returns a Series containing counts of unique values in a specified column.
   ```python
   value_counts = df['column_name'].value_counts()
   ```

9. **sort_values()**:
   - Sorts the DataFrame by specified column(s).
   ```python
   df_sorted = df.sort_values(by='column_name')
   ```

10. **groupby()**:
    - Groups the DataFrame by specified column(s) for further aggregation.
    ```python
    grouped_data = df.groupby('column_name').mean()
    ```

### DataFrame Attributes:

1. **index**:
   - Returns the index (row labels) of the DataFrame.
   ```python
   print(df.index)
   ```

2. **values**:
   - Returns a two-dimensional array of the DataFrame's values.
   ```python
   print(df.values)
   ```

3. **dtypes**:
   - Returns a Series with the data type of each column.
   ```python
   print(df.dtypes)
   ```

4. **shape**:
   - Returns a tuple representing the dimensions (rows, columns) of the DataFrame.
   ```python
   print(df.shape)
   ```

5. **columns**:
   - Returns an Index object containing the column labels.
   ```python
   print(df.columns)
   ```

6. **size**:
   - Returns the number of elements in the DataFrame (rows * columns).
   ```python
   print(df.size)
   ```

### Exporting the DataFrame to a CSV File

In [4]:
# Assuming df is your DataFrame
import pandas as pd

# Specify the file path for the CSV output
output_file_path = 'output_data.csv'

# Use the to_csv() method to export the DataFrame to a CSV file
df.to_csv(output_file_path, index=False)

# Display a message indicating the successful export
print(f"DataFrame exported to {output_file_path}")

DataFrame exported to output_data.csv


In [5]:
# Assuming df is your DataFrame
import pandas as pd

# Specify the file path for the CSV input
file_path = 'DataSet/p4n_emp.csv'

# Load the dataset into a DataFrame
df = pd.read_csv(file_path)

# Display the first few rows of the DataFrame
print("Original DataFrame:")
print(df.head())

# Filtering: Selecting employees aged 30 or younger
young_employees = df[df['age'] <= 30]

# Display the first few rows of the filtered DataFrame
print("\nYoung Employees:")
print(young_employees.head())

# Grouping: Calculate the average salary by job title
average_salary_by_job = df.groupby('job.title')['annual.salary'].mean().reset_index()

# Display the average salary by job title
print("\nAverage Salary by Job Title:")
print(average_salary_by_job)

# Summary Statistics: Displaying overall summary statistics of the dataset
summary_statistics = df.describe(include='all')

# Display the summary statistics
print("\nSummary Statistics:")
print(summary_statistics)

# Export the filtered DataFrame to a new CSV file
output_filtered_path = 'DataSet/young_employees.csv'
young_employees.to_csv(output_filtered_path, index=False)
print(f"\nFiltered DataFrame exported to {output_filtered_path}")

Original DataFrame:
                    name  age     sex  \
0         Aaron, Dayan M   38    Male   
1       Aaron, Freddie L   52    Male   
2        Aaron, Tyrone M   44    Male   
3        Abazenab, Kokeb   42  Female   
4  Abbott, Christopher D   32    Male   

                                       ethnic.origin  \
0  Black or African American (Not Hispanic or Lat...   
1  Black or African American (Not Hispanic or Lat...   
2  Black or African American (Not Hispanic or Lat...   
3  Black or African American (Not Hispanic or Lat...   
4                     White (Not Hispanic or Latino)   

                            job.title  \
0                    ATL311 Team Lead   
1  Environmental Service Worker I (D)   
2       Watershed Crew Supervisor (D)   
3         Benefits Representative, Sr   
4     Recreation Operations Assistant   

                                organization  annual.salary  
0                      EXE Executive Offices       45999.99  
1             DPW Departm

### Conclusion:

The `read_csv()` function in Pandas is a versatile tool that enables seamless data import from CSV files. In this tutorial, we've explored various scenarios, including basic usage, handling custom delimiters, managing missing values, and specifying custom column names. Armed with this knowledge, you'll be better equipped to tackle diverse datasets in your data science endeavors.

Visit CodesWithPankaj.com for more in-depth tutorials and coding insights to enhance your Python and Pandas skills. Happy coding!