---   

<h1 align="center">Introduction to Data Analyst and Data Science for beginners</h1>
<h1 align="center">Lecture no 2.13(Pandas-04)</h1>

---
<h3><div align="right">Ehtisham Sadiq</div></h3>    

## _IO with CSV EXCEL and JSON Files_

<img align="center" width="600" height="150"  src="images/fileformats.png" >

#### Read Pandas Documentation:
- General Info: https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html


- For `read_csv`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=read_csv#pandas.read_csv


- For `read_excel`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html?highlight=read_excel#pandas.read_excel


- For `read_json`:https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.io.json.read_json.html?highlight=pandas%20read_json#pandas.io.json.read_json


- For `to_csv`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html#pandas.DataFrame.to_csv



- For `to_excel`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html?highlight=to_excel#pandas.DataFrame.to_excel


- For `to_json`: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_json.html?highlight=to_json

## Learning agenda of this notebook
[Pandas](https://pandas.pydata.org/) provides helper functions to read data from various file formats like CSV, EXCEL, JSON, HTML, SQL table, and many more.
1. Reading data from a CSV/TSV File
2. Reading a CSV file from a Remote System
3. Writing Contents of Dataframe to a CSV File
4. Reading data from an EXCEL File
5. Writing Contents of Dataframe to an EXCEL File
6. Reading data from a JSON File
7. Writing Contents of Dataframe to a JSON File
8. Reading and Writing with SQL file

In [None]:
# To install this library in Jupyter notebook
import sys
!{sys.executable} -m pip install pandas --quiet

In [None]:
import pandas as pd
pd.__version__ , pd.__path__

## 1. Reading from CSV/TSV Files
>**CSV**: A text file in which the values are separated by a comma or a tab character is called a CSV or a TSV file. Each line of the file is a data record and each record consists of one or more fields, separated by a specific character called separator. A CSV/TSV file is typically used to store tabular data (numbers and text), in which each line will have the same number of fields.

### a. Reading a  Simple CSV File
The `pd.read_csv()` method is used to read a comma-separated file into a DataFrame.
```
pd.read_csv(fname, delimiter=None, header='infer', skiprows=None , nrows=None , usecols=None,  footer='',...)

```

In [None]:
! cat datasets/classmarks.csv

In [None]:
#The `read_csv`, by default assumes that the file contains comma separated values, 
# and the first row of the file conatins names of columns, which will be taken as column labels
df = pd.read_csv('datasets/classmarks.csv')
df

**The `df.head(N)` method is used to select/display first `N` rows, based on `position`, i.e., the integer value corresponding to the position of the row (from 0 to n-1). The default value of `N` is 5.**

In [None]:
df.head()

In [None]:
df.head(3)

In [None]:
# For negative values of n, this method returns all rows except the last `n` rows, equivalent to df[:-n].
# The df has a total of 50 rows, so the following will return first 2 rows
df.head(-48)

**The `df.tail(N)` method is used to select/display last `N` rows, based on `position`, i.e., the integer value corresponding to the position of the row (from 0 to n-1). The default value of `N` is 5.**

In [None]:
# tail() method is useful for quickly verifying data, after sorting or appending rows.
df.tail()

In [None]:
df.tail(3)

In [None]:
# For negative values of `n`, this function returns all rows except the first `n` rows, equivalent to df[n:]
# The df has a total of 50 rows, so the following will return last 3 rows
df.tail(-46)

**The `df.sample()` method returns a specified number of random rows. This method returns 1 row if a number is not specified.The column names will also be returned, in addition to the sample rows.**

In [None]:
df.sample(5)

### b.Reading a CSV File having a Delimter, other than Comma
- By default, the `read_csv()` expect comma as seperator. But if the CSV file has some other seperator or delimiter like (semi-collon or tab), it will raise an error.
- To handler the issue we need to pass specific value to the `delimiter` argument of `read_csv()` method.

In [None]:
! cat datasets/classmarkswithtab.csv

In [None]:
df = pd.read_csv('datasets/classmarkswithtab.csv')
df.head()

In [None]:
df = pd.read_csv('datasets/classmarkswithtab.csv', delimiter='\t')
df.head()

### c. Reading a CSV File not having Column Labels
- By default the `read_csv()` method assume the first row of the file will contain column labels
- If this is not the case, i.e., the file do not contain column labels rather data, it will be dealt as column label
- Understand this in following example

In [None]:
! cat datasets/classmarkswithoutcollabels.csv

In [None]:
df = pd.read_csv('datasets/classmarkswithoutcollabels.csv')
df.head()

**To read such files, you have to pass the parameter `header=None` to the `read_csv()` method as shown below**

In [None]:
df = pd.read_csv('datasets/classmarkswithoutcollabels.csv', header=None)
df.head()


**Now if you want to assign new column labels to make them more understandable, you can assign the list of column labels to the `columns` attribute of the dataframe object**

In [None]:
col_names = ['rollno', 'gender', 'group', 'age', 'math', 'english', 'urdu']
df.columns = col_names
df.head()

### d. Reading a CSV File having Comments in the beginning
- You may get an error while reading a CSV file because someone may have added few comments on the top of the file. In pandas we can still read the data set by skipping few rows from the top.
- To deal with the ParseError, open the csv file in the text editor and check if you have some comments on the top.
- If yes, then count the number of rows to skip.
- While reading file, pass the parameter **skiprows = n** (number of rows in the beginninghaving comments to skip)
- While reading file, pass the parameter **skipfooter = n** (number of rows at the end having comments to skip)

In [None]:
! cat datasets/classmarkswithtopcomments.csv

In [None]:
# Try reading a csv file having 3 comments lines in the beginning.
df = pd.read_csv('datasets/classmarkswithtopcomments.csv')
df.head()

In [None]:
# Try reading a csv file having 3 comments lines in the beginning.
df = pd.read_csv('datasets/classmarkswithtopcomments.csv', skiprows=3)
df.head()

### e. Reading a portion of CSV File in a Dataframe
- Suppose the dataset inside the csv file is too big and you don't want to spend that much time for reading that data
- Or might be your system crashes, when you try to load that much data
- Solution is read
    - Specific number of rows by passing `nrows` parameter to `read_csv()` method
    - Specific number of columns by passing `usecols` parameter to `read_csv()` method

In [None]:
# Read just 10 rows from the csv file by passing the number of rows to read to `nrows` argument
df = pd.read_csv('datasets/classmarks.csv', nrows=10)
df.shape
#df.head()

In [None]:
df

In [None]:
# Read specific columns from the csv file by passing a list of column names to `usecols` argument
df = pd.read_csv('datasets/classmarks.csv', usecols= ['rollno', 'group','english'])
df.shape
#df.head()

In [None]:
df.head()

In [None]:
# Ofcourse you can use both the parameters at the same time
df = pd.read_csv('datasets/classmarks.csv', nrows= 7, usecols= ['rollno', 'group','english'])
df.shape

In [None]:
df

## 2. Reading a CSV File from a Remote System

In [None]:
# To avoid URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED]..... 
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

### a. Reading a CSV file from GitHub Gist

The `data1.csv` file actually resides on my GitHub Gist at following URL:
https://raw.githubusercontent.com/bsef19m521/Introduction-to-Data-Analyst-and-Data-Science/master/Module%20no%2002-%20Python%20for%20Data%20Scientists/02%20-%20Pandas%20for%20Data%20Scientists/datasets/people.csv

[Bitly.ws](http://bitly.ws/) is a URL shortening service that I have used to create a short link for easy usage in the cell below:

In [None]:
import pandas as pd
url = "http://bitly.ws/tgCj"
df = pd.read_csv(url)
df

### b. Reading a CSV file from a Google Docs


- Google sheet url: https://docs.google.com/spreadsheets/d/1l-bh4Mga8JW3yvO0Cfr3P3E6_EpdIXH4yyvS6FOoZVI/edit#gid=212220603- 

In [None]:
sheetID = '1l-bh4Mga8JW3yvO0Cfr3P3E6_EpdIXH4yyvS6FOoZVI'
sheetName = 'sheet1'
URL = 'https://docs.google.com/spreadsheets/d/{0}/gviz/tq?tqx=out:csv&sheet={1}'.format(sheetID, sheetName)

df = pd.read_csv(URL)
df.head()

In [None]:
URL

## 3. Writing Contents of Dataframe to a CSV File
- The `pd.to_csv()` method is used to write the contents of a dataframe (with indices) to a CSV file.
- The only required argument is the file path.
- For details see help page or python documents (link given above)

In [None]:
df_class = pd.read_csv('datasets/classmarks.csv')
df_class.head(7)

>- Let us create a new dataframe from above dataframe containing records of only group B

In [None]:
mask = (df_class['group'] == 'group B')
mask.head(7)

In [None]:
df_class_groupB = df_class.loc[mask]
df_class_groupB

In [None]:
df_class_groupB.to_csv('datasets/classmarksgroupB.csv')

In [None]:
df = pd.read_csv('datasets/classmarksgroupB.csv')
df

>To avoid writing the row indices column inside the file pass `index=False` argument to `to_csv()` method

In [None]:
df_class_groupB.to_csv('datasets/classmarksgroupB.csv', index=False)

In [None]:
df = pd.read_csv('datasets/classmarksgroupB.csv')
df

## 4. I/O with EXCEL Files
>**XLSX**: XLSX is a Microsoft Excel Open XML file format. It also comes under the Spreadsheet file format. It is an XML-based file format created by Microsoft Excel. In XLSX data is organized under the cells and columns in a sheet. Each XLSX file may contain one or more sheets. So a workbook can contain multiple sheets

In [None]:
import sys
!{sys.executable} -m pip install xlrd xlwt openpyxl

### a. Reading a Simple Excel File

In [None]:
df = pd.read_excel(io='datasets/classmarks.xlsx')
df.head()

### b. Reading an Excel File having Comments in the beginning
- You may get an error while reading an Excel file because someone may have added few comments on the top of the file. In pandas we can still read the data set by skipping few rows from the top.
- To deal with the ParseError, open the Excel file in MS EXCEL and check if you have some comments on the top.
- If yes, then count the number of rows to skip.
- While reading file, pass the parameter **skiprows = n** (number of rows in the beginning having comments to skip)
- While reading file, pass the parameter **skipfooter = n** (number of rows at the end having comments to skip)

In [None]:
# The following file has three lines of comments in the beginning of the file.
df = pd.read_excel(io='datasets/classmarkswithcomments.xlsx')

df.head()

In [None]:
# The following file has three lines of comments in the beginning of the file.
df = pd.read_excel(io='datasets/classmarkswithcomments.xlsx',skiprows=3)

df.head()

### c. Reading Excel Workbook with Multiple Sheets
- By default `pd.read_excel()` function read only the first sheet.
- What if we want to read an Excel file having multiple sheets.
- The `big_mart_sales_with_multiple_sheets.xlsx` is a workbook that contains three sheets for different years data. The sheet names are 1985, 1987, and 1997

In [None]:
df = pd.read_excel('datasets/big_mart_sales_with_multiple_sheets.xlsx')
# if you check/view the data you can see, it only contains the data of first excel sheet (for the year 1985)
df.shape

In [None]:
df_1985 = pd.read_excel('datasets/big_mart_sales_with_multiple_sheets.xlsx',sheet_name='1985')
df_1987 = pd.read_excel('datasets/big_mart_sales_with_multiple_sheets.xlsx',sheet_name='1987')
df_1997 = pd.read_excel('datasets/big_mart_sales_with_multiple_sheets.xlsx',sheet_name='1997')

In [None]:
print("Sheet1 : ")
df_1985.shape

In [None]:
print("Sheet2 : ")
df_1987.shape

In [None]:
print("Sheet2 : ")
df_1997.shape

## 5. Writing Contents of Dataframe to an EXCEL File
- The `pd.to_excel()` method is used to write the contents of a dataframe (with indices) to an Excel file.
- The only required argument is the file path.
- For details see help page or python documents (link given above)

>- Let us create a new single dataframe after concatenating all the above three dataframes using `pd.concat()` method

In [None]:
df_concatenated = pd.concat(objs=[df_1985, df_1987, df_1997])

df_concatenated.shape

**Note the total number of rows in this dataframe equals to `1463+932+930 = 3325`**

In [None]:
df_concatenated.head()

In [None]:
# you can store the concatenated data inside your dataframe into a single Excel file
# You can mention the argument `index= false` for not storing row indices (0, 1,2,3,... in the Excel file.

df_concatenated.to_excel(excel_writer='temp.xlsx', index=False)

In [None]:
# Let us verify
data = pd.read_excel(io='temp.xlsx')

data.shape

## 6. I/O with JSON Files

>**JSON**: JavaScript Object Notation is a text-based open standard file format that uses human-readable text consisting of attribute–value pairs and arrays. It is a data interchange format that is used to store and transfer the data via Internet, primarily between a web client and a server.

## a. Reading a Simple JSON File
#### To view content of any json file , visit this [website](https://jsoneditoronline.org/#left=local.yoyezu)

In [None]:
import sys
!{sys.executable} -m pip install SQLAlchemy psycopg2-binary 

In [None]:
! cat datasets/simple.json

In [None]:
# read the json file using read_json method of pandas library
df = pd.read_json('datasets/simple.json')
df

### b. Reading JSON File having each record in a separate line
- Some of the json files are written as records i.e each json line is a separate json object. For example:
```
{ 'name' : 'Ahsan', 'roll_no' : '100' } # line 1
{ 'name' : 'Ayesha' , 'roll_no' : '101' } # line 2
```

In [None]:
! cat datasets/simple_records.json

In [None]:
# To read such file you need to pass `lines=True` to the `read_json()` method of dataframe
df = pd.read_json('datasets/simple_records.json',lines=True)

df

## 7. Writing Contents of Dataframe to a JSON File

In [None]:
df.to_json('datasets/temp.json')

In [None]:
df = pd.read_json('datasets/temp.json')
df

## 8. Reading and writing with SQL file

In [None]:
# First of all we install mysql.connector on our machine
!pip install mysql.connector

In [None]:
# Second, we import mysql.connector to create connection object
import warnings
warnings.filterwarnings('ignore')
import mysql.connector

In [None]:
# Third we create our connection object
conn = mysql.connector.connect(host='localhost',user='root',password='Pucit12345#',database='ranking')

In [None]:
# We will execute our query
df = pd.read_sql_query("SELECT * FROM DataTable",conn)

In [None]:
df.head()

## Check Your Concepts:
- What is Pandas?

# Pandas - Assignment no 04
- Here is link of [Pandas - Assignment no 04]()