# Data I/O Using Python

## Overview
Data import and export are processes that involve transferring data between different systems or formats. Data import is the process of bringing data into a data analysis tool or platform from an external source, such as a database, file, or web service. Data export, on the other hand, involves transferring data from the analysis tool to another destination, such as a file, database, or web service.

Data import and export play a critical role in data wrangling because they enable data analysts to access and integrate data from various sources and prepare it for analysis. By importing data into a data analysis tool, analysts can combine data from different sources, perform transformations, and clean the data to remove inconsistencies, errors, and missing values.

In addition, data export enables analysts to share their results with other stakeholders, or external systems. By exporting data in a suitable format, analysts can create reports, dashboards, or visualizations that convey their findings effectively. Data export also allows analysts to integrate their results with other tools or systems, such as business intelligence platforms, data warehouses, or machine learning models.

Effective data import and export require a good understanding of the data formats, data sources, and data destinations involved. It is crucial to choose the appropriate import and export methods, considering factors such as data size, structure, complexity, and security requirements.

In conclusion, data import and export are critical components of data wrangling that facilitate the integration, preparation, and sharing of data for analysis. By mastering these processes, you can streamline workflows, reduce errors, and produce more accurate and actionable insights.

In this module, we will cover the following topics:

I. Importing data: Extracting data from different types of files or sources.
II. Exporting data: Writing data to different types of files.


## Learning Objectives
In this module, the learners will:

* Understand the importance of data import and export in data analysis
* Recall the different file types that can be used to import and export data
* Apply data import and export techniques to different file types
* Assess the appropriateness of various data import and export techniques for specific data analysis tasks
Let's get started!


## Dataset
Titanic dataset: This is a well-known and widely used dataset in the field of data analysis and machine learning. This dataset contains information about the passengers on the Titanic ship, including their demographic information, ticket information, and survival status. In this exercise, we're using the Titanic dataset to perform various hypothesis tests.

Here's a description of the columns in the dataset:

* PassengerId: This column is a unique identifier assigned to each passenger.
* Age: This column specifies the age of the passenger.
* Name: This column specifies the name of the passenger.
* Sex: This column specifies the gender of the passenger (Male or Female).
* Survived: This column specifies whether the passenger survived the Titanic disaster or not. The values in this column can either be 0 (did not survive) or 1 (survived).
* Pclass (Passenger Class): This column specifies the class of the passenger (1st, 2nd, or 3rd class).
* SibSp (Siblings/Spouses Aboard): This column specifies the number of siblings or spouses the passenger was traveling with.
* Parch (Parents/Children Aboard): This column specifies the number of parents or children the passenger was traveling with.
* Ticket: This column specifies the ticket number assigned to the passenger.
* Fare: This column specifies the fare paid by the passenger for their ticket.
* Cabin: This column specifies the cabin number assigned to the passenger.
* Embarked: This column specifies the port where the passenger boarded the Titanic (C = Cherbourg; Q = Queenstown; S = Southampton).

# Importing Data

## What is importing data?
Importing data refers to the process of loading data from external sources into a program or software for further analysis. Importing data allows us to access data from various sources such as CSV files, Excel spreadsheets, APIs, and web pages.

## Why is it important?
In the context of data wrangling, importing data is a crucial step as it is the first step in the process of data cleaning and preparation. Once we have imported the data, we can start analyzing it to gain insights, build models, or create visualizations.

It is important to import data accurately and completely, as any errors or missing data can lead to inaccurate analysis and results. Therefore, it is crucial to perform data quality checks during the import process to ensure that the data is in the correct format, has no missing values, and is free from errors.

## Import from Excel
Importing data from Excel is a common approach. Excel is a popular spreadsheet application that allows users to store, organize, and analyze data in tabular format. Importing data from Excel into a data analysis tool or platform can enable you to access and analyze data more efficiently and accurately.

Here is an example code snippet of importing data from an Excel file into a Pandas DataFrame in Python:

In [1]:
import pandas as pd

# Load the Excel file into a pandas DataFrame using the read_excel() function
df = pd.read_excel('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/TitanicPassengerRawDataset.xlsx')

# Print the first five rows of the DataFrame
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In this code, we use the 'read_excel()' function to load the Excel file 'TitanicPassengerRawDataset.xlsx' into a Pandas DataFrame. We can also use the 'sheet_name' parameter to specify the name of the sheet we want to load.

### CAVEAT
Importing data from Excel can introduce issues such as improper data formatting, type conversion errors, and data integrity problems due to easy user modification. Careful attention should be paid to ensure that the imported data is complete, accurate, and properly formatted for analysis.

## Import from CSV
Importing data from CSV is also a common approach. CSV stands for "Comma-Separated Values" and is a file format used to store and exchange data in a plain-text format. Importing data from CSV files into a data analysis tool or platform can enable you to access and analyze data more efficiently and accurately.

Here is an example code snippet of importing data from a CSV file into a Pandas DataFrame:

In [2]:
import pandas as pd

# Load the CSV file into a pandas DataFrame using the read_csv() function
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Print the first five rows of the DataFrame
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  


In this code, we use the 'read_csv()' function to load the CSV file 'titanic.csv' into a Pandas DataFrame. The function automatically detects the delimiter used in the CSV file and sets it as the default delimiter for the DataFrame.

## Import from JSON
JSON stands for "JavaScript Object Notation" and is a lightweight data-interchange format that is easy for humans to read and write, and easy for machines to parse and generate. Importing data from JSON files into a data analysis tool or platform can enable you to access and analyze data more efficiently and accurately.

Here is an example code snippet of importing data from a JSON file into a Pandas DataFrame:

In [3]:
import pandas as pd

# Load the JSON file into a pandas DataFrame using the read_json() function
df = pd.read_json('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.json')

# Print the first five rows of the DataFrame
print(df.head())

   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500  None        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250  None        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500  None        S  


In this code, we use the 'read_json()' function to load the JSON file 'titanic.json' into a Pandas DataFrame. The function automatically detects the JSON format and converts it into a DataFrame.

## Import from XML
XML (Extensible Markup Language) is a popular markup language used to store and exchange data over the internet. Importing data from XML files is a common data-wrangling task that involves reading the data from an XML file and converting it into a format that can be used for analysis.

To import data from an XML file, you will need to first parse the XML document using an appropriate library in your programming language. The most commonly used library for parsing XML data is the ElementTree library in Python.

Here is an example code that demonstrates how to import data from an XML file:

In [4]:
import xml.etree.ElementTree as ET
import urllib.request

# specify the URL
url = 'https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.xml'

# specify the local file path to save the downloaded XML file
file_path = 'titanic.xml'

# download the XML file
urllib.request.urlretrieve(url, file_path)

# parse the XML document
tree = ET.parse(file_path)
root = tree.getroot()

# display the entire XML document as a string
xml_string = ET.tostring(root)
print(xml_string)

 

b'<data><record><PassengerId>1</PassengerId><Survived>0</Survived><Pclass>3</Pclass><Name>Braund, Mr. Owen Harris</Name><Sex>male</Sex><Age>22.0</Age><SibSp>1</SibSp><Parch>0</Parch><Ticket>A/5 21171</Ticket><Fare>7.25</Fare><Cabin>nan</Cabin><Embarked>S</Embarked></record><record><PassengerId>2</PassengerId><Survived>1</Survived><Pclass>1</Pclass><Name>Cumings, Mrs. John Bradley (Florence Briggs Thayer)</Name><Sex>female</Sex><Age>38.0</Age><SibSp>1</SibSp><Parch>0</Parch><Ticket>PC 17599</Ticket><Fare>71.2833</Fare><Cabin>C85</Cabin><Embarked>C</Embarked></record><record><PassengerId>3</PassengerId><Survived>1</Survived><Pclass>3</Pclass><Name>Heikkinen, Miss. Laina</Name><Sex>female</Sex><Age>26.0</Age><SibSp>0</SibSp><Parch>0</Parch><Ticket>STON/O2. 3101282</Ticket><Fare>7.925</Fare><Cabin>nan</Cabin><Embarked>S</Embarked></record><record><PassengerId>4</PassengerId><Survived>1</Survived><Pclass>1</Pclass><Name>Futrelle, Mrs. Jacques Heath (Lily May Peel)</Name><Sex>female</Sex><Ag

In this code, we first import the ElementTree library and specify the path to our XML file. We then parse the XML document using the 'ET.parse()' method and store the root element of the tree in the 'root' variable.

Then we use the 'ET.tostring()' method to convert the root element to a string and store the result in the variable 'xml_string'. Finally, it prints out the string representation of the XML document.

## Import from HTML
Importing data from HTML can be useful in cases where the data is available on a website or a webpage, but not in a downloadable format such as CSV or Excel. To extract the data, we can use web scraping techniques to extract the HTML content and then parse it to extract the relevant data.

Suppose you need to extract the names of the blogs from a particular webpage for analysis. To do this, you will need to use web scraping techniques to extract the information from the webpage.

One way to extract data from HTML is to use a Python library called Beautiful Soup. BeautifulSoup is a popular library for web scraping and HTML parsing. It provides a simple and intuitive API for navigating and searching HTML documents.

Here is an example code that demonstrates how to import data from HTML using BeautifulSoup:

In [5]:
import requests
from bs4 import BeautifulSoup

# send a request to the webpage
url = "https://datasciencedojo.com/pyds-webscraping/"
response = requests.get(url)

 # parse the HTML content using Beautiful Soup
soup = BeautifulSoup(response.content, "html.parser")

 # find all the blog names using the tag and class
blog_names = soup.find_all('h2', class_='elementor-heading-title elementor-size-default')
# print the names of the blogs
for blog in blog_names:
 print(blog.text.strip())

DISCOVER MORE OF WHAT MATTERS TO YOU


### NOTE
Using the 'h2' tag with the given class may fetch other information present in the HTML with similar tags and classes.

This code is an example of web scraping using Python's requests and BeautifulSoup libraries. It sends a request to the given URL and retrieves the HTML content of the webpage. Then, it uses BeautifulSoup to parse the HTML content and extract the required information.

In this case, the code finds all the blog names on the webpage using the 'find_all()' method with the specified tag and class and then prints the names of the blogs. The 'strip()' method is used to remove any leading or trailing whitespaces from the extracted text.



# Exporting Data

## What is exporting data?
Exporting data refers to the process of saving or writing data from a software application to a file in a specific format. In data wrangling, exporting data is essential to share or use the processed data for analysis, reporting, or further processing. Exported data can be in various formats, including CSV, JSON, Excel, or any other file formats, depending on the requirements of the downstream applications.

## Why is it important?
Exporting data enables data analysts, and data scientists to perform analysis, modeling, or generate insights from the data. Exporting data also helps in data governance, where data needs to be shared across different teams or departments while ensuring data privacy and security. It also helps in data migration, where data needs to be moved from one system to another while ensuring data quality and consistency.

Overall, exporting data is an essential part of data wrangling, as it enables us to make use of the processed data, communicate the findings, and derive insights from the data.

## Export to JSON
Exporting data to JSON is a popular approach for sharing and transferring data between systems. In this context, we can export the Titanic dataset in JSON format.

JSON (JavaScript Object Notation) is a lightweight and widely used format for data interchange. It is easy to read and write for humans and machines alike, making it a popular choice for web-based applications.

To export the Titanic dataset to JSON, we can use the following code:

In [6]:
import pandas as pd

# Load the Titanic dataset into a pandas dataframe
titanic_df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Export the dataframe to JSON
titanic_df.to_json('titanic.json', orient='records')
print("Successfully Exported")

Successfully Exported


### NOTE
JSON files exported from Jupyter Notebook may not be directly accessible from this platform. To access the file, you will need to run the export code on your local system and save the JSON file to a location where you can easily access it. Once saved, you can then open the file in your preferred JSON viewer or editor.

In the code above, we first load the Titanic dataset into a Pandas dataframe using the 'read_csv()' method. Then, we use the 'to_json()' method to export the dataframe to a JSON file named 'titanic.json'. The 'orient' parameter is set to 'records' to ensure that each row in the dataframe is represented as a separate JSON object.

Once the code is executed, a new file named 'titanic.json' will be created in the same directory as the Python script. This file will contain the entire Titanic dataset in JSON format.

Exporting data to JSON can be useful in various scenarios such as sharing data between different programming languages or applications, or transferring data over the internet. JSON format is also compatible with many NoSQL databases and can be used for storing data in a more flexible and scalable manner.

## Export to XML
Exporting data to XML is another way of storing data in a structured format. XML stands for Extensible Markup Language, and it uses tags to define data elements and attributes. In this case, we can use it to export the Titanic dataset to an XML file.

To export data to XML, we will first need to create an XML template that defines the structure of the output file. We can use the ElementTree module in Python to create and modify XML documents.

Here's an example of how to export data to an XML file:

In [7]:
import pandas as pd
import xml.etree.ElementTree as ET

# Load the Titanic dataset into a pandas DataFrame
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Convert the DataFrame to an ElementTree object
root = ET.Element('data')
for i, row in df.iterrows():
 record = ET.SubElement(root, 'record')
 for column in df.columns:
  value = str(row[column])
  ET.SubElement(record, column).text = value

# Write the ElementTree object to an XML file
tree = ET.ElementTree(root)
tree.write('the_titanic.xml')

print("Successfully Exported")

Successfully Exported


### NOTE
XML files exported from Jupyter Notebook may not be directly accessible from this platform. To access the file, you will need to run the export code on your local system and save the XML file to a location where you can easily access it. Once saved, you can then open the file in your preferred XML viewer or editor.

In the example above, we first load the Titanic dataset into a Pandas DataFrame using the 'read_csv()' function. We then create an ElementTree object by iterating over each row in the DataFrame and creating a new 'record' element for each row. For each column in the DataFrame, we create a new element with the column name and set its text to the corresponding value from the DataFrame.

Finally, we write the ElementTree object to an XML file using the 'write()' method. The resulting file will contain a top-level 'data' element with one child 'record' element for each row in the DataFrame and 'child' elements for each column in the DataFrame.

## Export to HTML
Exporting data to HTML involves converting a Pandas DataFrame into an HTML file format that can be displayed in a web browser. This is useful for sharing data with others or for visualizing data more interactively.

Here is an example code snippet for exporting a Pandas DataFrame to an HTML file:

In [8]:
import pandas as pd

# Load the Titanic dataset
titanic_df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Export the DataFrame to an HTML file
titanic_df.to_html('titanic.html')

print("Successfully Exported")

Successfully Exported


### NOTE
HTML files exported from Jupyter Notebook may not be directly accessible from this platform. To access the file, you will need to run the export code on your local system and save the HTML file to a location where you can easily access it. Once saved, you can then open the file in your preferred HTML viewer or editor.

In this example, we first load the Titanic dataset into a Pandas DataFrame called 'titanic_df'. We then use the 'to_html()' method to export the DataFrame to an HTML file called 'titanic.html'. By default, this method will generate an HTML table that displays the contents of the DataFrame.

The resulting 'titanic.html' file can be opened in a web browser to view the table. The table will include column headers and row indices by default, but these can be customized using various parameters of the 'to_html()' method.

## Export to Excel
Exporting data to Excel is a common task in data analysis and visualization. In this task, we will export the Titanic dataset to an Excel file using Python. We can use the 'Workbook()' method to write the data into an Excel file.

Here is an example code snippet that demonstrates how to export the Titanic dataset to an Excel file:

In [9]:
import pandas as pd
import xlsxwriter

# Load the Titanic dataset into a pandas DataFrame
df = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Create a new Excel file
workbook = xlsxwriter.Workbook('the_titanic.xlsx', {'nan_inf_to_errors': True})

# Create a new worksheet in the Excel file
worksheet = workbook.add_worksheet()

# Write the column headers to the worksheet
for i, column in enumerate(df.columns):
 worksheet.write(0, i, column)

# Write the data to the worksheet
for i, row in df.iterrows():
 for j, value in enumerate(row):
  worksheet.write(i+1, j, value)

# Save the Excel file
workbook.close()

print("Successfully Exported")

ModuleNotFoundError: No module named 'xlsxwriter'

### NOTE
Excel file exported from Jupyter Notebook may not be directly accessible from this platform. To access the file, you will need to run the export code on your local system and save the Excel file to a location where you can easily access it. Once saved, you can then open the file in your preferred Excel viewer or editor.

In the code above, a new Excel file is created using the 'Workbook()' function from the xlsxwriter library, and a new worksheet is added to the Excel file using the 'add_worksheet()' function. The 'nan_inf_to_errors' option is set to True to handle NaN and Inf values.

The column headers of the DataFrame are written to the worksheet's first row using a for loop that iterates over the columns.

Similarly, the data from the DataFrame is written to the worksheet using another for loop that iterates over each row and column in the DataFrame. The value from each cell is written to the corresponding cell in the Excel worksheet. Finally, the 'close()' method is called on the workbook to save and close the Excel file.

### CAVEAT
Exporting data to Excel can introduce potential issues such as loss of data formatting, incorrect data type conversion, and formatting inconsistencies. To avoid such issues, proper care should be taken to ensure that the exported data is correctly formatted and validated before exporting to Excel.

## Export to LaTex
Exporting data to LaTeX allows you to create professional-looking tables that can be easily included in LaTeX documents. In the context of the Titanic dataset, exporting data to LaTeX can be useful for presenting summary statistics or results of statistical analyses.

Here's an example of how to export data from the Titanic dataset to a LaTeX table:

In [None]:
import pandas as pd

# Load the Titanic dataset
titanic = pd.read_csv('https://staticasssets.blob.core.windows.net/open-ai-coderunner/scripts/titanic.csv')

# Calculate summary statistics for the 'age' column
age_summary = titanic['Age'].describe()

# Export summary statistics to a LaTeX table
age_table = age_summary.to_frame().to_latex()

# Print the LaTeX table to the console
print(age_table)

This code calculates summary statistics for the 'Age' column in the Titanic dataset using the 'describe()' method. Next, the 'to_frame()' method is used to convert the summary statistics into a Pandas DataFrame, which is then passed to the 'to_latex()' method to generate the LaTeX code for the table.

### CAUTION
When exporting data to LaTeX format, it is important to note that the exported LaTeX code may not be compatible with all LaTeX compilers or versions. This is especially true if you are using specialized packages or formatting options. To ensure that your exported data is compatible and produces the desired results, be prepared to test it on multiple systems and with multiple compilers.