<a href="https://colab.research.google.com/github/zia207/Python_for_Beginners/blob/main/Notebook/01_01_07_data_import_export_python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](http://drive.google.com/uc?export=view&id=1IFEWet-Aw4DhkkVe1xv_2YYqlvRe9m5_)

# 1.4 Data Import/Export in Python

This tutorial provides a comprehensive overview of how to import and export data in Python, covering various file formats such as CSV, Excel, JSON, and more. It also includes practical examples and code snippets to help you understand the process better. By the end of this tutorial, you will have a solid understanding of how to work with data in Python and be able to apply these techniques in your own projects.

## Introduction

One of the most important steps in data analysis is importing data into Python and exporting data from Python. This process can be done using various functions and libraries depending on the format of the data, such as CSV, Excel, or files from statistical software like SPSS or Stata. Mastering data import and export is a fundamental skill for any data scientist or analyst, as it allows you to work with real-world datasets and share your results effectively.

### Check and Install Required Python Packages

In this exercise, we will use the following Python libraries:

1.  **pandas**: The primary library for data manipulation and analysis. It can read and write CSV, Excel, and JSON files.
2.  **openpyxl**: A dependency for `pandas` to read and write modern Excel (`.xlsx`) files.
3.  **pyreadstat**: To read and write data from other statistical software like SPSS, Stata, and SAS. This is the Python equivalent of R's `haven` and `foreign` packages.
4.  **json**: A built-in Python module for handling JSON data.

In [None]:
# List of required packages
required_packages = [
    'pandas',
    'openpyxl',   # For Excel .xlsx files
    'pyreadstat'  # For SPSS, Stata, SAS files
    # 'json' is built-in, so no need to install
]

import importlib
import subprocess
import sys

def install_package(package):
    """Install a package using pip."""
    subprocess.check_call([sys.executable, "-m", "pip", "install", package])

# Check and install missing packages
for package in required_packages:
    try:
        importlib.import_module(package)
        print(f"'{package}' is already installed.")
    except ImportError:
        print(f"'{package}' is not installed. Installing now...")
        install_package(package)

print("\n All required packages are installed.")

'pandas' is already installed.
'openpyxl' is already installed.
'pyreadstat' is not installed. Installing now...
Defaulting to user installation because normal site-packages is not writeable
Collecting pyreadstat
  Using cached pyreadstat-1.3.1-cp310-cp310-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (633 kB)
Collecting narwhals>=2.0
  Using cached narwhals-2.5.0-py3-none-any.whl (407 kB)
Installing collected packages: narwhals, pyreadstat
Successfully installed narwhals-2.5.0 pyreadstat-1.3.1

 All required packages are installed.


### Import Python Libraries

Now that the packages are installed, we import them for use.

In [None]:
# Import the necessary libraries
import pandas as pd
import json
import pyreadstat
import os

print("Libraries successfully imported.")

Libraries successfully imported.


### Set Working Directory and Define Data Paths

Before we start, it's good practice to set or check your working directory. This is where Python will look for files to read and where it will save files by default.

In [None]:
# Check current working directory
print("Current Working Directory:")
print(os.getcwd())

# Define the data folder path
# Replace this with the path to your local data folder
data_folder = "/home/zia207/Dropbox/WebSites/Python_Website/Quarto_Projects/Python_for_Beginners/Data/"

# If you want to set this as your working directory, uncomment the line below:
os.chdir(data_folder)

# List files in the data directory (optional)
try:
    print(f"\nFiles in '{data_folder}':")
    print(os.listdir(data_folder))
except FileNotFoundError:
    print(f"\n  The directory '{data_folder}' does not exist. Using GitHub URLs for data.")

# For this tutorial, we will primarily use direct URLs from GitHub to ensure the code runs anywhere.
github_data_url = "https://github.com/zia207/Python_for_Beginners/tree/main/Data/"



Current Working Directory:
/home/zia207/Dropbox/WebSites/GitHub_repository/python-websites/Python_for_Beginners/Notebook

Files in '/home/zia207/Dropbox/WebSites/Python_Website/Quarto_Projects/Python_for_Beginners/Data/':
['LBC_Data_PM25.csv', 'nepal_df_meta_data.csv', 'rice_data.dta', 'LBC_Data.csv', 'rice_data.sav', '03_poverty_2004_2016_both.csv', 'data_2004_2016_long_diabestes.csv', 'rice_data.rds', 'test_data.json', 'DT.csv', 'output.csv', 'data_2004.csv', 'diabetes_dignosed_2004_2016_total.csv', 'rice_data.xlsx', 'nepal_df_balance.csv', 'test_data.xpt', 'test_data.sav', 'nyc-taxi-tiny.zip', 'LBC_data.feather', 'taxi-zone-lookup.csv', '06_uninsured_2012_2016.csv', 'data_2009.csv', 'taxi_zone_lookup.csv', 'data_2005.csv', 'LBC_Data_ID.csv', 'data_2008.csv', 'test_data.txt', 'gp_soil_data.csv', 'rice_data.csv', 'test_data.sas7bdat', 'data_2001.csv', 'data_2002.csv', 'data_2006.csv', 'napal_data.feather', 'data_2007.csv', 'df.chem_02.csv', 'rice_data.RData', 'rice_data.json', '04_edu

## Data Import Into Python

Data importing is the process of reading data from external files or databases into Python for analysis. Python, with its rich ecosystem of libraries, makes this process straightforward for a wide variety of file formats.

### Read Text File (.txt)

A text file (`.txt`) contains plain text. In data science, these are often delimited files (e.g., tab-separated or space-separated). We use `pandas.read_csv()` for this, specifying the appropriate delimiter.

In [None]:
# Read a .txt file (assuming it's tab or space-delimited)
# If it's space-delimited, use sep='\s+'
df_txt = pd.read_csv(
    os.path.join(data_folder, "test_data.txt"),
    sep='\t',  # Change this to ' ' or ',' as needed
    header=0   # Use the first row as column names
)

# Or read directly from GitHub
df_txt = pd.read_csv(
    "https://github.com/zia207/Python_for_Beginners/raw/refs/heads/main/Data/test_data.txt",
    sep='\t',  # Adjust delimiter
    header=0
)

print(df_txt.head())
print(df_txt.columns.tolist())

   ID   treat   var  rep     PH    TN    PN    GW  ster    DTM    SW    GAs  \
0   1  Low As  BR01    1   84.0  28.3  27.7  35.7  20.5  126.0  28.4  0.762   
1   2  Low As  BR01    2  111.7  34.0  30.0  58.1  14.8  119.0  36.7  0.722   
2   3  Low As  BR01    3  102.3  27.7  24.0  44.6   5.8  119.7  32.9  0.858   
3   4  Low As  BR06    1  118.0  23.3  19.7  46.4  20.3  119.0  40.0  1.053   
4   5  Low As  BR06    2  115.3  16.7  12.3  19.9  32.3  120.0  28.2  1.130   

    STAs  
0  14.60  
1  10.77  
2  12.69  
3  18.23  
4  13.72  
['ID', 'treat', 'var', 'rep', 'PH', 'TN', 'PN', 'GW', 'ster', 'DTM', 'SW', 'GAs', 'STAs']


If you want to set `data_folder` as your working directory, read data from there:

In [None]:
# If you want to set data_folder as your working directory,
os.chdir(data_folder)
df_txt = pd.read_csv("test_data.txt", sep='\t')
print(df_txt.head())

   ID   treat   var  rep     PH    TN    PN    GW  ster    DTM    SW    GAs  \
0   1  Low As  BR01    1   84.0  28.3  27.7  35.7  20.5  126.0  28.4  0.762   
1   2  Low As  BR01    2  111.7  34.0  30.0  58.1  14.8  119.0  36.7  0.722   
2   3  Low As  BR01    3  102.3  27.7  24.0  44.6   5.8  119.7  32.9  0.858   
3   4  Low As  BR06    1  118.0  23.3  19.7  46.4  20.3  119.0  40.0  1.053   
4   5  Low As  BR06    2  115.3  16.7  12.3  19.9  32.3  120.0  28.2  1.130   

    STAs  
0  14.60  
1  10.77  
2  12.69  
3  18.23  
4  13.72  


### Read Comma-Separated File (.csv)

A CSV (Comma-Separated Values) file is one of the most common formats for data exchange. The `pandas.read_csv()` function is designed for this.

In [None]:
# Read a .csv file
df_csv = pd.read_csv(
    os.path.join(data_folder, "test_data.csv")
)

# Or from GitHub
df_csv = pd.read_csv(
    "https://github.com/zia207/Python_for_Beginners/raw/refs/heads/main/Data/test_data.csv"
)
print(df_csv.head())

   ID   treat   var  rep     PH    TN    PN    GW  ster    DTM    SW    GAs  \
0   1  Low As  BR01    1   84.0  28.3  27.7  35.7  20.5  126.0  28.4  0.762   
1   2  Low As  BR01    2  111.7  34.0  30.0  58.1  14.8  119.0  36.7  0.722   
2   3  Low As  BR01    3  102.3  27.7  24.0  44.6   5.8  119.7  32.9  0.858   
3   4  Low As  BR06    1  118.0  23.3  19.7  46.4  20.3  119.0  40.0  1.053   
4   5  Low As  BR06    2  115.3  16.7  12.3  19.9  32.3  120.0  28.2  1.130   

    STAs  
0  14.60  
1  10.77  
2  12.69  
3  18.23  
4  13.72  


### Read Excel Files (.xlsx, .xls)

To read Excel files, we use `pandas.read_excel()`. This function can handle both the modern `.xlsx` format and the legacy `.xls` format.

In [None]:
# Read an Excel file from your local directory or from GitHub
# Note: Reading .xlsx files directly from a URL can sometimes be unreliable.
# For local files, use: pd.read_excel(os.path.join(data_folder, "test_data.xlsx"))

# Read an Excel file
df_xl = pd.read_excel(
    os.path.join(data_folder, "test_data.xlsx"),
    sheet_name=0  # Read the first sheet (0-indexed)
)


# or

try:
    df_xl = pd.read_excel(
        github_data_url + "test_data.xlsx",
        sheet_name=0  # Read the first sheet
    )
    print("First few rows of the Excel file:")
    print(df_xl.head())
except Exception as e:
    print(f"Could not read Excel file from URL: {e}")
    print("Please download the file locally and adjust the path.")

Could not read Excel file from URL: Excel file format cannot be determined, you must specify an engine manually.
Please download the file locally and adjust the path.


### Read JSON Files (.json)

JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write and easy for machines to parse and generate. In Python, we can read it using `pandas.read_json()` or the built-in `json` module.

In [None]:
# Method 1: Using pandas (if the JSON is a simple array of objects)
try:
    df_json = pd.read_json(
        github_data_url + "test_data.json"
    )
    print("First few rows of the JSON file (via pandas):")
    print(df_json.head())
except Exception as e:
    print(f"Could not read JSON file with pandas: {e}")

# Method 2: Using the json module (more flexible for complex structures)
import urllib.request
import json

try:
    with urllib.request.urlopen(github_data_url + "test_data.json") as url:
        json_data = json.load(url)

    # Convert to DataFrame if it's a suitable structure
    df_json = pd.DataFrame(json_data)
    print("\nFirst few rows of the JSON file (via json module):")
    print(df_json.head())

except Exception as e:
    print(f"Could not read JSON file with json module: {e}")

Could not read JSON file with pandas: Expected object or value
Could not read JSON file with json module: Expecting value: line 7 column 1 (char 6)


### Import Data from Other Statistical Software

The `pyreadstat` library allows Python to read and write data formats from statistical software like SPSS, Stata, and SAS. It is the direct counterpart to R's `haven` package.

#### Read STATA File (.dta)

In [None]:
df_dta, meta_dta = pyreadstat.read_dta(
    os.path.join(data_folder, "test_data.dta")
)
print("First few rows of the STATA file:")
print(df_dta.head())


First few rows of the STATA file:
   ID   treat   var  rep     PH    TN    PN    GW  ster    DTM    SW    GAs  \
0   1  Low As  BR01    1   84.0  28.3  27.7  35.7  20.5  126.0  28.4  0.762   
1   2  Low As  BR01    2  111.7  34.0  30.0  58.1  14.8  119.0  36.7  0.722   
2   3  Low As  BR01    3  102.3  27.7  24.0  44.6   5.8  119.7  32.9  0.858   
3   4  Low As  BR06    1  118.0  23.3  19.7  46.4  20.3  119.0  40.0  1.053   
4   5  Low As  BR06    2  115.3  16.7  12.3  19.9  32.3  120.0  28.2  1.130   

    STAs  
0  14.60  
1  10.77  
2  12.69  
3  18.23  
4  13.72  


#### Read SPSS File (.sav)

In [None]:
 # Read a .sav file
df_sav, meta_sav = pyreadstat.read_sav(
    os.path.join(data_folder, "test_data.sav")
)
print("First few rows of the STATA file:")
print(df_sav.head())

First few rows of the STATA file:
    ID   treat   var  rep     PH    TN    PN    GW  ster    DTM    SW    GAs  \
0  1.0  Low As  BR01  1.0   84.0  28.3  27.7  35.7  20.5  126.0  28.4  0.762   
1  2.0  Low As  BR01  2.0  111.7  34.0  30.0  58.1  14.8  119.0  36.7  0.722   
2  3.0  Low As  BR01  3.0  102.3  27.7  24.0  44.6   5.8  119.7  32.9  0.858   
3  4.0  Low As  BR06  1.0  118.0  23.3  19.7  46.4  20.3  119.0  40.0  1.053   
4  5.0  Low As  BR06  2.0  115.3  16.7  12.3  19.9  32.3  120.0  28.2  1.130   

    STAs  
0  14.60  
1  10.77  
2  12.69  
3  18.23  
4  13.72  


#### Read SAS File (.sas7bdat)

In [None]:
# Read a .sas7bdat file
df_sas, meta_sas = pyreadstat.read_sas7bdat(
    os.path.join(data_folder, "test_data.sas7bdat")
)

print("First few rows of the SAS file:")
print(df_sas.head())


First few rows of the SAS file:
    ID   treat   var  rep     PH    TN    PN    GW  ster    DTM    SW    GAs  \
0  1.0  Low As  BR01  1.0   84.0  28.3  27.7  35.7  20.5  126.0  28.4  0.762   
1  2.0  Low As  BR01  2.0  111.7  34.0  30.0  58.1  14.8  119.0  36.7  0.722   
2  3.0  Low As  BR01  3.0  102.3  27.7  24.0  44.6   5.8  119.7  32.9  0.858   
3  4.0  Low As  BR06  1.0  118.0  23.3  19.7  46.4  20.3  119.0  40.0  1.053   
4  5.0  Low As  BR06  2.0  115.3  16.7  12.3  19.9  32.3  120.0  28.2  1.130   

    STAs  
0  14.60  
1  10.77  
2  12.69  
3  18.23  
4  13.72  


## Export Data from Python

Data exporting is the process of saving data from Python to a file format that can be used by other software or systems. This is crucial for sharing results and collaborating.

### Create a Sample Data Frame

First, let's create a simple data frame that we will use for exporting.

In [None]:
# Create sample data
rice_data = pd.DataFrame({
    'Variety': ["BR1","BR3", "BR16", "BR17", "BR18", "BR19","BR26", "BR27","BR28","BR29","BR35","BR36"],
    'Yield': [5.2,6.0,6.6,5.6,4.7,5.2,5.7, 5.9,5.3,6.8,6.2,5.8]
})

print("Sample Data Frame:")
print(rice_data)

Sample Data Frame:
   Variety  Yield
0      BR1    5.2
1      BR3    6.0
2     BR16    6.6
3     BR17    5.6
4     BR18    4.7
5     BR19    5.2
6     BR26    5.7
7     BR27    5.9
8     BR28    5.3
9     BR29    6.8
10    BR35    6.2
11    BR36    5.8


### Write as CSV File

The `to_csv()` method exports a DataFrame to a CSV file.

In [None]:
# Write to CSV
output_path_csv = os.path.join(data_folder, "rice_data.csv") if os.path.exists(data_folder) else "rice_data.csv"

rice_data.to_csv(
    output_path_csv,
    index=False  # Do not write row indices
)

print(f" Data exported to CSV: {output_path_csv}")

 Data exported to CSV: /home/zia207/Dropbox/WebSites/Python_Website/Quarto_Projects/Python_for_Beginners/Data/rice_data.csv


### Write as Excel File

The `to_excel()` method exports a DataFrame to an Excel file.

In [None]:
# Write to Excel
output_path_xlsx = os.path.join(data_folder, "rice_data.xlsx") if os.path.exists(data_folder) else "rice_data.xlsx"

rice_data.to_excel(
    output_path_xlsx,
    index=False,
    engine='openpyxl'
)

print(f" Data exported to Excel: {output_path_xlsx}")

 Data exported to Excel: /home/zia207/Dropbox/WebSites/Python_Website/Quarto_Projects/Python_for_Beginners/Data/rice_data.xlsx


### Write as JSON File

The `to_json()` method exports a DataFrame to a JSON file.

In [None]:
# Write to JSON
output_path_json = os.path.join(data_folder, "rice_data.json") if os.path.exists(data_folder) else "rice_data.json"

rice_data.to_json(
    output_path_json,
    orient='records',  # Creates a list of dictionaries
    indent=4           # Pretty-print with indentation
)

print(f" Data exported to JSON: {output_path_json}")

 Data exported to JSON: /home/zia207/Dropbox/WebSites/Python_Website/Quarto_Projects/Python_for_Beginners/Data/rice_data.json


### Export to Other Statistical Software

Using `pyreadstat`, we can export our DataFrame to formats used by SPSS, Stata, and SAS.

#### Write STATA File (.dta)

In [None]:
output_path_dta = os.path.join(data_folder, "rice_data.dta") if os.path.exists(data_folder) else "rice_data.dta"

pyreadstat.write_dta(
    rice_data,
    output_path_dta
)

print(f"Data exported to STATA: {output_path_dta}")

Data exported to STATA: /home/zia207/Dropbox/WebSites/Python_Website/Quarto_Projects/Python_for_Beginners/Data/rice_data.dta


#### Write SPSS File (.sav)

In [None]:
output_path_sav = os.path.join(data_folder, "rice_data.sav") if os.path.exists(data_folder) else "rice_data.sav"

pyreadstat.write_sav(
    rice_data,
    output_path_sav
)

print(f" Data exported to SPSS: {output_path_sav}")

 Data exported to SPSS: /home/zia207/Dropbox/WebSites/Python_Website/Quarto_Projects/Python_for_Beginners/Data/rice_data.sav


### Save Python Objects (Pickle)

To save Python objects (like DataFrames, lists, or dictionaries) for later use in Python, we use the `pickle` module. This is analogous to saving `.RData` or `.rds` files in R.

In [None]:
import pickle

# Save a single object (like R's saveRDS)
output_path_pkl = os.path.join(data_folder, "rice_data.pkl") if os.path.exists(data_folder) else "rice_data.pkl"

with open(output_path_pkl, 'wb') as file:
    pickle.dump(rice_data, file)

print(f"Single object saved as Pickle: {output_path_pkl}")

# To load it back:
# with open(output_path_pkl, 'rb') as file:
#     loaded_df = pickle.load(file)

# Save multiple objects (like R's save)
multi_objects = {
    'dataframe': rice_data,
    'variety_list': rice_data['Variety'].tolist(),
    'yield_list': rice_data['Yield'].tolist()
}

output_path_multi_pkl = os.path.join(data_folder, "multi_objects.pkl") if os.path.exists(data_folder) else "multi_objects.pkl"

with open(output_path_multi_pkl, 'wb') as file:
    pickle.dump(multi_objects, file)

print(f"Multiple objects saved as Pickle: {output_path_multi_pkl}")

Single object saved as Pickle: /home/zia207/Dropbox/WebSites/Python_Website/Quarto_Projects/Python_for_Beginners/Data/rice_data.pkl
Multiple objects saved as Pickle: /home/zia207/Dropbox/WebSites/Python_Website/Quarto_Projects/Python_for_Beginners/Data/multi_objects.pkl


## Summary and Conclusion

This guide covers the essential skills for importing and exporting data in Python. Key libraries like `pandas` and `pyreadstat` provide robust, easy-to-use functions for handling a wide array of file formats.

- **pandas** is your go-to for CSV, Excel, and JSON.
- **pyreadstat** is indispensable for working with SPSS, Stata, and SAS files.
- The built-in **json** and **pickle** modules offer fine-grained control for their respective formats.

By mastering these techniques, you can efficiently bring data into your Python environment for analysis and export your results for use in other applications or to share with colleagues.

## Resources

1.  [pandas Documentation](https://pandas.pydata.org/docs/)
2.  [pyreadstat Documentation](https://ofajardo.github.io/pyreadstat_documentation/_build/html/index.html)
3.  [Python json Module](https://docs.python.org/3/library/json.html)
4.  [Python pickle Module](https://docs.python.org/3/library/pickle.html)