## Getting Data From Different Resources

There are multiple ways of getting data into Python, depending on where the data is stored. The simplest case is when you have data in CSV files, but often, you need to get data from other formats, sources and documents, such as text files, relational databases, websites, APIs, PDF documents etc. 

Let us explore some methods to extract data from different sources.

### Text Files

Delimited files are usually text files, where columns are separated by delimiters (such as commas, tabs, semicolons, etc.), and each new line is a row. We will be learning how to embed and manipulate such data using Python.

We use the pandas function __`read_csv()`__ in this case, while specifying the delimiter as required. Let us see an example.

In [176]:
import pandas as pd

pd.read_csv('https://github.com/yashj1301/Python3-UpGrad-UMich/raw/master/Python%203.x/'+
            'Upgrad/Modules/Module%205%20-%20Data%20Extraction%20and%20Cleaning/Data/addresses.txt').head()

Unnamed: 0,777 Brockton Avenue,Abington MA 2351
0,30 Memorial Drive,Avon MA 2322
1,250 Hartford Avenue,Bellingham MA 2019
2,700 Oak Street,Brockton MA 2301
3,66-4 Parkhurst Rd,Chelmsford MA 1824
4,591 Memorial Dr,Chicopee MA 1020


### MySQL Databases

RDBMS (Relational Database Management System) is also often used to store data, and they are simple to import into Python. We’ll utilise __MySQL__, which is the most widely used.

Many libraries, such as __`pymysql`, `MySQLdb`__, and others, connect MySQL with Python. We will use the __`sqlalchemy`__ library here. To connect to MySQL, they all use the procedure outlined below:

- Create a connection object between MySQL and Python
- Construct a cursor object (you use the cursor to open and close the connection)
- Run the SQL query.
- Using methods like `fetchone()` and `fetchall()`, you can get the query’s results.

Let us see it in action. 

#### Importing necessary libraries

In [None]:
from google.colab import auth #for google cloud sql database
from google.cloud.sql.connector import Connector # for creating a connector object
import sqlalchemy # for establishing connection
import sys #for installing necessary dependencies

auth.authenticate_user() # for authenticating google cloud user
!{sys.executable} -m pip install cloud-sql-python-connector["pymysql"] SQLAlchemy


#### Creating a connection

In [3]:
# initialize parameters
INSTANCE_CONNECTION_NAME = f"upgrad-learning:asia-south2:upgrad-mysql" # i.e demo-project:us-central1:demo-instance
print(f"Your instance connection name is: {INSTANCE_CONNECTION_NAME}")
DB_USER = "root"
DB_PASS = "root"
DB_NAME = "mysql"

Your instance connection name is: upgrad-learning:asia-south2:upgrad-mysql


In [4]:
# initialize Connector object
connector = Connector()

# function to return the database connection object
def getconn():
    conn = connector.connect(
        INSTANCE_CONNECTION_NAME,
        "pymysql",
        user=DB_USER,
        password=DB_PASS,
        db=DB_NAME
    )
    return conn

# create connection pool with 'creator' argument to our connection object function
pool = sqlalchemy.create_engine(
    "mysql+pymysql://",
    creator=getconn,
)

#### Running MySQL Queries

In [5]:
with pool.connect() as db_conn:
  # showing all the tables in our database 'mysql'
  for i in db_conn.execute("SHOW TABLES").fetchall():
    print(i)


('audit_log_rules',)
('audit_log_rules_expanded',)
('audit_log_supported_ops',)
('cloudsql_replica_index',)
('columns_priv',)
('component',)
('db',)
('default_roles',)
('engine_cost',)
('func',)
('general_log',)
('global_grants',)
('gtid_executed',)
('heartbeat',)
('help_category',)
('help_keyword',)
('help_relation',)
('help_topic',)
('innodb_index_stats',)
('innodb_table_stats',)
('password_history',)
('plugin',)
('procs_priv',)
('proxies_priv',)
('replication_asynchronous_connection_failover',)
('replication_asynchronous_connection_failover_managed',)
('replication_group_configuration_version',)
('replication_group_member_actions',)
('role_edges',)
('server_cost',)
('servers',)
('slave_master_info',)
('slave_relay_log_info',)
('slave_worker_info',)
('slow_log',)
('tables_priv',)
('time_zone',)
('time_zone_leap_second',)
('time_zone_name',)
('time_zone_transition',)
('time_zone_transition_type',)
('user',)


In [6]:
with pool.connect() as db_conn:
  # showing only one table in our database 'mysql'
  for i in db_conn.execute("SHOW TABLES").fetchone():
    print(i)


audit_log_rules


In [11]:
with pool.connect() as db_conn:
  # showing all the tables in our database 'mysql'
  for i in db_conn.execute("desc user").fetchall():
    print(i)


('Host', 'char(255)', 'NO', 'PRI', '', '')
('User', 'char(32)', 'NO', 'PRI', '', '')
('Select_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Insert_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Update_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Delete_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Create_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Drop_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Reload_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Shutdown_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Process_priv', "enum('N','Y')", 'NO', '', 'N', '')
('File_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Grant_priv', "enum('N','Y')", 'NO', '', 'N', '')
('References_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Index_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Alter_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Show_db_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Super_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Create_tmp_table_priv', "enum('N','Y')", 'NO', '', 'N', '')
('Lock_tables_priv', 

### Websites (Web Scraping)

__Web scraping__ is the process of extracting data from the internet using a computer application. Python makes scraping websites simple, which is one of its best advantages.

__`BeautifulSoup`__ is the most popular web scraping package in Python 3. We’ll also need the `requests` module to use it, which connects to a URL and retrieves data from it (in HTML format). BeautifulSoup’s main purpose is to let you quickly parse HTML code, which is what a web page is made up of.

Once you have a bs4 object, you can use it to get specific parts of the HTML document using the `soup.select()` method. Let us see it in action.

In [177]:
import requests, bs4

# getting HTML from the Google Play web page
url = "https://play.google.com/store/apps/details?id=com.facebook.orca&hl=en"
req = requests.get(url)

# create a bs4 object
# To avoid warnings, provide "html5lib" explicitly
soup = bs4.BeautifulSoup(req.text, "html5lib")

# getting all the text inside "div" tag
reviews = soup.select('div')
print(type(reviews))
print(len(reviews))


<class 'list'>
434


### Using APIs

APIs, or __Application Programming Interfaces__, are created by companies and organisations to provide restricted access to data. It is very common to get data from APIs for data analysis; for example, you can get financial data (stock prices, etc.), social media data (Facebook, Twitter, etc.), weather data, data about healthcare, music, food and drinks, and from almost every domain.

Apart from being rich sources of data, there are other reasons to use APIs:

- We use APIs when the data requires to be updated in real-time. By contrast, if you use downloaded CSV files, you'll have to download data manually every time it changes and update your analysis multiple times in your program.
- Through APIs, you can automate the process of getting real-time data.
APIs also provide easy access to structured and verified data. Though you can scrape websites, APIs can directly offer data in a structured format with better quality.
- APIs provide access to restricted data. Often, you cannot scrape all websites easily, which is often illegal (Example: Facebook, financial data, etc.). APIs are the only way to get this data.

We use the libraries - `requests` and `json` - to extract data from APIs. Let us see them in action.

In [16]:
import numpy as np
import pandas as pd

# Need requests to connect to the URL, json to convert JSON to dict
import requests, json
import pprint

# Input to the fn: Address in standard human-readable form
# Output: Tuple (lat, lng)

api_key = "AIzaSyBXrK8md7uaOcpRpaluEGZAtdXS4pcI5xo"
add = "UpGrad, Nishuvi building, Anne Besant Road, Worli, Mumbai"
split_address = add.split(" ")
address = "+".join(split_address)
print(address)

UpGrad,+Nishuvi+building,+Anne+Besant+Road,+Worli,+Mumbai


In [17]:
api_key = "AIzaSyBXrK8md7uaOcpRpaluEGZAtdXS4pcI5xo"

url = "https://maps.googleapis.com/maps/api/geocode/json?address={0}&key={1}".format(address, api_key)
r = requests.get(url)

# The r.text attribute contains the text in the response object
print(type(r.text))
print(r.text)

<class 'str'>
{
   "error_message" : "You must enable Billing on the Google Cloud Project at https://console.cloud.google.com/project/_/billing/enable Learn more at https://developers.google.com/maps/gmp-get-started",
   "results" : [],
   "status" : "REQUEST_DENIED"
}



### Data from PDFs

Reading PDF files is not as straightforward as reading text or delimited files using Python since PDFs often contain images, tables, etc. PDFs are mainly designed to be human-readable, and thus you need special libraries to read them in Python (or any other programming language).

Luckily, there are some great libraries in Python. We will use __`PyPDF2`__ to read PDFs in Python since it is easy to use and works with most PDFs. Note that Python will <font color="red">only be able to read text from PDFs, not images, tables etc</font> (though that is possible using other specialised libraries). 

Let us see it in action.

In [178]:
!pip install PyPDF2
import PyPDF2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [180]:
obj=open('/content/sample_data/animal_farm.pdf','rb')
read = PyPDF2.PdfFileReader(obj)
page=read.getPage(0)

page.extractText()



'Animal Farm\nGeorge Orwell\n1945'