# Agenda 

__Part - I__

  - Structured, Semi-Structured and Unstructured Data

  - File Formats
    - CSV
    - JSON
    - Working with Zip files


__Part-II__

  - Use of pandas for reading csv, tsv files
  - Use of json library for json files
  - Use of zipfile library for zip files
  - working with Text files

__Part-III__

LAB

# Part-I

## Types of Data (Classifying data by its *nature*)


__What is data?__

__Structured Data__

- Conforms a data model
- Fields of data will be stored 
- How that data will be stored: data type (numeric, alphabetic, date, Boolean) 
- Any restrictions on the data input (number of characters, etc.).

__Examples__

- Databases


<img src = "https://database.guide/wp-content/uploads/2016/06/MySQL_Schema_Music_Example.png" width = 550 />


[Image Source](https://database.guide/what-is-a-database-schema/)

__Semi-Structured Data__

- There is a structure in the data in terms of tags, rows, columns, hierarchies, fields.

- But not necessarily there are unified restrictions and pre-defined types for fields and tags.



An Example of semi-structured data: https://json.reddit.com/r/dataisbeautiful/

__Unstructured Data__

From Wikipedia: https://en.wikipedia.org/wiki/Unstructured_data

- No pre-defined data model

- Not organized in fields, tags, attributes, etc.

__Examples__

social media posts, images, texts.

## File Formats (How do we record data)

__What is a file?__



- create a docx file and open it from sublime text

__Why do we need file formats__


- Share

- Store - Preserve

- Access

__Frequently used file formats in data science__


### CSV Files (Comma Separated Values)



- Uses `.csv` extension.

- As name suggested data is organized in columns and each column is separated with a column.

- Very common format in data science.

- Looks very similar to spreadsheet files however note that you cannot encode formulas and format. 

- Because there is no formating and formulas it is easier to work with this format for bigger file sizes.

- If we use `tab` instead of `,` then the file format is called `tsv` (tab separated values)

Data.csv https://github.com/jackiekazil/data-wrangling/blob/master/data/chp3/data-text.csv

### JSON (JavaScript Object Notation )

- Uses `.json` extension.

- It is an alternative to `.xml` files and it is more human readiable.

- It requires less syntax.

- [Check nested objects](https://www.digitalocean.com/community/tutorials/an-introduction-to-json#:~:text=JSON%20%E2%80%94%20short%20for%20JavaScript%20Object,like%20the%20name%20%E2%80%9CJason.%E2%80%9D)

# Part-II


Let's download the file.

In [None]:
from urllib import request

def download_file(file_name, url):
    res = request.urlopen(url)
    with open(file_name,'wb') as file:
        file.write(res.read())
    
download_file('data-text.json', 'https://raw.githubusercontent.com/jackiekazil/data-wrangling/master/data/chp3/data-text.json')
download_file('data-text.csv' , 'https://raw.githubusercontent.com/jackiekazil/data-wrangling/master/data/chp3/data-text.csv')

In [None]:
!ls

In [None]:
!head -3 data-text.csv

### Read and Write Files in Python

## CSV files

In [None]:
import csv 

In [None]:
path = 'data-text.csv'

## with open read a cvs file

## note that csv has reader method

with open(path, mode = 'r')as file:
    i = 0
    reader = csv.reader(file)
    for row in reader:
        if i == 1:
            print(row[5])
            break
        i += 1

In [None]:
# better way - enumarate

path = 'data-text.csv'

with open(path, mode = 'r')as file:
    reader = csv.reader(file)
    for i, row in enumerate(reader):
        if i == 1:
            print(row[5])
            break


Note that we could read this file in a dictionary form also.

In [None]:
## we could also use 'DictReader' method
country_list = []
with open(path) as f:
    reader = csv.DictReader(f)
    for row in reader:
        country_list.append(row['Country'])


In [None]:
len(set(country_list))

In some cases using pandas for reading `.csv` files will be the most convenient.

In [None]:
import pandas as pd

In [None]:
## use pandas to read the same file
df = pd.read_csv(path)

df.head(4)

In fact pandas.read_csv is very convenient to load data directly from internet also

In [None]:
## pandas can read files from web without downloading them first

link = 'https://raw.githubusercontent.com/jackiekazil/data-wrangling/master/data/chp3/data-text.csv'
df_form_link = pd.read_csv(link)

df_form_link.head(5)

## JSON files

In [None]:
import json

In [None]:
## in contrast to csv module json uses load method

with open('data-text.json') as file:
    json_file = json.load(file)
    for row in json_file:
        print(row)

In [None]:
json_file

Again we can also use pandas to be able to read json files.

In [None]:
## not surprisingly pandas can read json too

pd.read_json('data-text.json')

## Working with Zipped Files

In [4]:
from zipfile import ZipFile

__Load a zip file before continue__

In [5]:
file_url = 'https://github.com/msaricaumbc/DS601_Fall21/raw/main/Week06/data/world_cup.csv.zip'
file_name = 'world_cup.zip'

download_file(file_name, file_url)

NameError: name 'download_file' is not defined

In [2]:
def unzip(file_name, path='./'):
    # opening the zip file in READ mode 
    with ZipFile(file_name, 'r') as zip: 
        # printing all the contents of the zip file 
        zip.printdir() 

        # extracting all the files 
        print('Extracting all the files now...') 
        zip.extractall(path = path) 
        print('Done!') 

In [3]:
unzip(file_name)

NameError: name 'file_name' is not defined

In [None]:
!ls

In [None]:
pd.read_csv('world_cup.csv')

For more on extracting and writing zip files: https://www.geeksforgeeks.org/working-zip-files-python/

# Lab

Please complete your lab

# Homework

- You will be provided a zip file. Choose one of the csv files and apply the perform the same work