---
# Reading and Writing Data to Different Sources

Data are stored in many different ways. We will be discussing loading data into pandas and storing them into different file types.

---

In [2]:
import pandas as pd
import numpy as np
from IPython.display import display

In [3]:
# Function for printing a horizontal line. For display purpose
def printhr(s: str = None, n: int = 40):
    """Print a horizontal rule of the character "=" of length n.

    Args:
        s (str, optional): Header message. Defaults to None.
        n (int, optional): Number of characters. Defaults to 50.
    """

    if s:
        print("=" * int(n / 2), s, "=" * int(n / 2))
    else:
        print("=" * n)

---
## Comma-Separated Values - .csv

CSV is a plain text file where each column is separated by a delimiter (comma).

---

In [10]:
# # Load in csv as df and make the ResponseId the index 
df = pd.read_csv("data/survey_results_public_2022.csv", index_col="ResponseId")

df.head(3)

Unnamed: 0_level_0,MainBranch,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,YearsCodePro,...,TimeSearching,TimeAnswering,Onboarding,ProfessionalTech,TrueFalse_1,TrueFalse_2,TrueFalse_3,SurveyLength,SurveyEase,ConvertedCompYearly
ResponseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,None of these,,,,,,,,,,...,,,,,,,,,,
2,I am a developer by profession,"Employed, full-time",Fully remote,Hobby;Contribute to open-source projects,,,,,,,...,,,,,,,,Too long,Difficult,
3,"I am not primarily a developer, but I write co...","Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Friend or family member...,Technical documentation;Blogs;Programming Game...,,14.0,5.0,...,,,,,,,,Appropriate in length,Neither easy nor difficult,40205.0


In [5]:
# Write to csv

# Create new df
filt = df["Country"] == "Japan"
japan_df = df.loc[filt]

# Write to csv file. This will create a file named csv_file.csv 
# inside a folder named new_files
japan_df.to_csv("new_files/csv_file.csv")

---
### Delimiters

Since CSVs are just plain text files, they can be delimited with different characters. This separator (delimiter) can be specified by any single character, the common ones being comma, tab, and colon.  

The **sep** parameter can be specified if the delimiter is other than a comma, both on reading (`.read_csv`) and writing (`.to_csv`). This defaults to a comma ( , ).

---

In [9]:
# Write to tab-separated value (TSV) file.
# TSV is variation of CSV. TSV uses tab as its delimiter.

# Create new df
filt = df["Country"] == "Germany"
germany_df = df.loc[filt]
display(germany_df.head(3))

# Write
germany_df.to_csv("new_files/tsv_file.tsv", sep="\t")

Unnamed: 0_level_0,MainBranch,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,YearsCodePro,...,TimeSearching,TimeAnswering,Onboarding,ProfessionalTech,TrueFalse_1,TrueFalse_2,TrueFalse_3,SurveyLength,SurveyEase,ConvertedCompYearly
ResponseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,"I am not primarily a developer, but I write co...","Student, full-time",,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Books / Physical media;School (i.e., Universit...",,,15,,...,,,,,,,,Appropriate in length,Easy,
26,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Other online resources ...,Technical documentation;Blogs;Written Tutorial...,Coursera;Udemy;Codecademy;edX;Udacity,16,9.0,...,30-60 minutes a day,60-120 minutes a day,Somewhat short,DevOps function;Microservices;Continuous integ...,Yes,No,Yes,Appropriate in length,Neither easy nor difficult,90647.0
49,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Contribute to open-source projects,Some college/university study without earning ...,Books / Physical media;Other online resources ...,Technical documentation;Blogs;Written Tutorial...,,40,25.0,...,15-30 minutes a day,Less than 15 minutes a day,Just right,Continuous integration (CI) and (more often) c...,Yes,No,Yes,Appropriate in length,Easy,106644.0


---
Loading with different separator

---

In [11]:
# Load in tab separated values (tsv) to a DataFrame
df = pd.read_csv("new_files/tsv_file.tsv", sep="\t", index_col="ResponseId")
display(df.head(3))


Unnamed: 0_level_0,MainBranch,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,YearsCodePro,...,TimeSearching,TimeAnswering,Onboarding,ProfessionalTech,TrueFalse_1,TrueFalse_2,TrueFalse_3,SurveyLength,SurveyEase,ConvertedCompYearly
ResponseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,"I am not primarily a developer, but I write co...","Student, full-time",,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Books / Physical media;School (i.e., Universit...",,,15,,...,,,,,,,,Appropriate in length,Easy,
26,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Other online resources ...,Technical documentation;Blogs;Written Tutorial...,Coursera;Udemy;Codecademy;edX;Udacity,16,9.0,...,30-60 minutes a day,60-120 minutes a day,Somewhat short,DevOps function;Microservices;Continuous integ...,Yes,No,Yes,Appropriate in length,Neither easy nor difficult,90647.0
49,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Contribute to open-source projects,Some college/university study without earning ...,Books / Physical media;Other online resources ...,Technical documentation;Blogs;Written Tutorial...,,40,25.0,...,15-30 minutes a day,Less than 15 minutes a day,Just right,Continuous integration (CI) and (more often) c...,Yes,No,Yes,Appropriate in length,Easy,106644.0


---
## Excel - .xlsx and .xls

Excel files are Microsoft's proprietery spreadsheet files. XLSX is the new Excel file format and can be read only by Excel 2007 and later. XLS is the older file format and can be read by all versions.

Unlike CSV files, reading and writing excel files require additional package installs:  
`openpyxl` - for writing and reading xlsx, can also write to xls  
`xlrd` - for reading old xls  

pip supports multiple installs in 1 expression if you want to install both:    
`pip install openpyxl xlrd`

---

---
### Reading and Writing .xlsx

---

In [27]:
# Reading
excel_df = pd.read_excel("data/excel_new.xlsx", index_col=0)
display(excel_df.head())

Unnamed: 0_level_0,First Name,Last Name,Gender,Country,Age,Date,Id
0,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,Dulce,Abril,Female,United States,32,15/10/2017,1562
2,Mara,Hashimoto,Female,Great Britain,25,16/08/2016,1582
3,Philip,Gent,Male,France,36,21/05/2015,2587
4,Kathleen,Hanner,Female,United States,25,15/10/2017,3549
5,Nereida,Magwood,Female,United States,58,16/08/2016,2468


In [26]:
# Writing
display(df.head(3))
df.to_excel("new_files/new_excel.xlsx")

Unnamed: 0_level_0,MainBranch,Employment,RemoteWork,CodingActivities,EdLevel,LearnCode,LearnCodeOnline,LearnCodeCoursesCert,YearsCode,YearsCodePro,...,TimeSearching,TimeAnswering,Onboarding,ProfessionalTech,TrueFalse_1,TrueFalse_2,TrueFalse_3,SurveyLength,SurveyEase,ConvertedCompYearly
ResponseId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
6,"I am not primarily a developer, but I write co...","Student, full-time",,,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)","Books / Physical media;School (i.e., Universit...",,,15,,...,,,,,,,,Appropriate in length,Easy,
26,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby,"Master’s degree (M.A., M.S., M.Eng., MBA, etc.)",Books / Physical media;Other online resources ...,Technical documentation;Blogs;Written Tutorial...,Coursera;Udemy;Codecademy;edX;Udacity,16,9.0,...,30-60 minutes a day,60-120 minutes a day,Somewhat short,DevOps function;Microservices;Continuous integ...,Yes,No,Yes,Appropriate in length,Neither easy nor difficult,90647.0
49,I am a developer by profession,"Employed, full-time","Hybrid (some remote, some in-person)",Hobby;Contribute to open-source projects,Some college/university study without earning ...,Books / Physical media;Other online resources ...,Technical documentation;Blogs;Written Tutorial...,,40,25.0,...,15-30 minutes a day,Less than 15 minutes a day,Just right,Continuous integration (CI) and (more often) c...,Yes,No,Yes,Appropriate in length,Easy,106644.0


---
### Reading and Writing .xls

---