# Next-Level Data Science
## Web Data Extraction & Analysis with Pandas

## Today

- Why is this workshop ?
- About Speaker
- Recap of `Python + Pandas`
- Snippets Download
- XML Data
- Web Page
- Feedbacks & Solutions
- API Data
- Q & A


# Recap of Python + Pandas

# Pandas Data Structure

- Series

- DataFrame

# Creating Pandas DataStructure

## Series

- list
- tuple
- dictionary

In [None]:
import pandas as pd

In [None]:
even = [2, 4, 6, 8, 10, 12]

series_data = pd.Series(even)
series_data

In [None]:
series_data.name = "even"

In [None]:
series_data.info

In [None]:
student = {'name':"Kishore", 'age':20, 'place':"Bengaluru", 'email':"kishore@email.com"}
student_data = pd.Series(student)
student_data

In [None]:
student_data.info

# DataFrame

In [None]:
student = {
    "name": ['Kishore', 'Vinay', 'Adithya', 'Kumar', 'Vijaya'],
    "place": ['Bengaluru', 'Mysore', 'Bengaluru', 'Tumakuru', 'Mysore'],
    "age": [21, 23, 20, 21, 19]
}

student_df = pd.DataFrame(student)
student_df

In [None]:
student_df.shape

In [None]:
student_df.index

In [None]:
student_df.columns

In [None]:
student_df.name

# XML Data

## Simple XML Data


```xml
<?xml version="1.0" encoding="UTF-8"?>
<catalog>
    <board>
        <name>Raspberry Pi Pico W</name>
        <controller>RP2040</controller>
        <cost>750</cost>
    </board>
    <board>
        <name>Raspberry Pi Pico 2W</name>
        <controller>RP2350</controller>
        <cost>950</cost>
    </board>
    <board>
        <name>Arduino Nano</name>
        <controller>Atmega328P</controller>
        <cost>600</cost>
    </board>
    <board>
        <name>Arduino Uno</name>
        <controller>Atmega328P</controller>
        <cost>1200</cost>
    </board>
    <board>
        <name>NodeMCU ESP8266</name>
        <controller>ESP32</controller>
        <cost>550</cost>
    </board>
</catalog>

```

In [None]:
xml_data = '''<?xml version="1.0" encoding="UTF-8"?>
<catalog>
    <board>
        <name>Raspberry Pi Pico W</name>
        <controller>RP2040</controller>
        <cost>750</cost>
    </board>
    <board>
        <name>Raspberry Pi Pico 2W</name>
        <controller>RP2350</controller>
        <cost>950</cost>
    </board>
    <board>
        <name>Arduino Nano</name>
        <controller>Atmega328P</controller>
        <cost>600</cost>
    </board>
    <board>
        <name>Arduino Uno</name>
        <controller>Atmega328P</controller>
        <cost>1200</cost>
    </board>
    <board>
        <name>NodeMCU ESP8266</name>
        <controller>ESP32</controller>
        <cost>550</cost>
    </board>
</catalog>
'''

In [None]:
import pandas as pd
from io import StringIO

xml_pd = pd.read_xml(StringIO(xml_data))

In [None]:
xml_pd

# SEO

## Infosys Sitemap XML

url = "https://www.infosys.com/sitemap.xml"

In [None]:
infosys_url = "https://www.infosys.com/sitemap.xml"

infosys_pd = pd.read_xml(infosys_url)

In [None]:
infosys_pd

In [None]:
infosys_pd.shape

In [None]:
infosys_pd['loc'].nunique()

In [None]:
infosys_pd.changefreq.value_counts()

# Hourly Check ?

In [None]:
infosys_pd[infosys_pd.changefreq == 'hourly']

In [None]:
infosys_pd[infosys_pd.changefreq == 'hourly'].iloc[0]['loc']

In [None]:
# filtering for daily updated url links

infosys_pd.changefreq == 'daily'

In [None]:
infosys_pd[infosys_pd.changefreq == 'daily']

In [None]:
infosys_pd[(infosys_pd['changefreq'] == 'daily') & (infosys_pd['priority'] == 1.0)]

# HTML

## FDIC

### Federal Deposit Insurance Corporation

[Failed Bank List](https://www.fdic.gov/bank-failures/failed-bank-list)

In [None]:
bank_html = pd.read_html("https://www.fdic.gov/bank-failures/failed-bank-list")

In [None]:
bank_html

In [None]:
type(bank_html)

In [None]:
len(bank_html)

In [None]:
bank_pd = pd.DataFrame(bank_html[0])

In [None]:
bank_pd

## Is it only 24 Rows ?

[Failed Bank List](https://www.fdic.gov/bank-failures/failed-bank-list)

In [None]:
url = "https://www.fdic.gov/bank-failures/failed-bank-list?pg="
contents = []

for i in range(1,24):
    url_p = "https://www.fdic.gov/bank-failures/failed-bank-list?pg=" + str(i)
    contents.append(pd.read_html(url_p))

len(contents)

In [None]:
datas = []

for content in contents:
    datas.append(pd.DataFrame(content[0]))

all_bank_pd = pd.concat(datas)

In [None]:
all_bank_pd

## Reindex

In [None]:
all_bank_pd = all_bank_pd.reset_index(drop=True)
all_bank_pd

# Feedbacks & Solutions

## Requests

- Video Recording
- Session on Weekdays
- Complete Series in a stretch

## Issues faced

- Time and Effort
- No of participants
- No Feedback's 

## Solutions

- Live Sessions on Youtube
- Introductory: 14 day IoT Internship 
- Feedback > Certificate 

# API

## Application Programmable Interface

# JSON
## JavsScript Object Notation

In [None]:
todo_url = "https://jsonplaceholder.typicode.com/todos/"
todo_pd = pd.read_json(todo_url)

In [None]:
todo_pd

In [None]:
dummy_user_url = "https://dummyjson.com/users"
users = pd.read_json(dummy_user_url)

In [None]:
users

In [None]:
users_df = pd.json_normalize(users['users'])

In [None]:
users_df

In [None]:
users_df.gender.value_counts()

# NASA

## Meteorite Landing

[link](https://data.nasa.gov/docs/legacy/meteorite_landings/gh4g-9sfh.json)

In [None]:
url = "https://data.nasa.gov/docs/legacy/meteorite_landings/gh4g-9sfh.json"
nasa_df = pd.read_json(url)

In [None]:
nasa_df