# Introduction
One of the most significant hurdles in real-world ML projects is the scarcity of high-quality data. Data collection itself is a substantial undertaking, often requiring significant time and resources. Moreover, the collected data rarely arrives in a pristine format ready for immediate model training and model building. Extensive data cleaning and preprocessing are essential steps to address issues like,
- Missing values: Handling missing data points accurately is crucial to prevent biased or inaccurate model predictions.
- Inconsistent formatting: Data can be inconsistent in terms of units, capitalization and other formatting aspects, requiring careful cleaning and standardization.
- Outliers: Extreme values can significantly impact model performance and require appropriate handling (removal, transformation, etc.).
- Data imbalances: Class imbalances, where one class has significantly fewer samples than other, can lead to biased models.
- Noise: Data can be contaminated with noise, which can degrade model performance.

Some of the common sources of data include,
- Flat files: These are simple text-based files like `.csv`, `.txt`, `.dat` and `.xlsx`, often used for storing tabular data.
- DBMS: Relational databases (SQL) and NoSQL databases provide structured and unstructured data storage and retrieval mechanisms.
- Web APIs: APIs allow programmatic access to data from various sources, such as social media platforms, weather services and financial markets.

# What Are APIs?
API stands for Application Programming Interface. It is like a set of rules or a contract that defines how different software applications can talk to each other and share data.

Think of it as a waiter at a restaurant, the customer (the application) tells the waiter (the API) what they want (data or functionality), the waiter communicates the customer's request to the kitches (the other application) and then brings the food to the customer (the response).

### How APIs work?
1. Request: An application sends a request to an API, speifying what it wants (e.g., weather data for a specific location).
2. Processing: The API receives the request, processes it and fetches the necessary data from the relevant system.
3. Response: The API sends a response back to the original application, containing the requested data or information.

### Why are APIs important?
- Data sharing: APIs enable seamless data exchange between different applications, allowing businesses to integrate their systems and improve efficiency.
- Innovation: APIs foster innovation by allowing developers to build new applications and services on top of existing platforms, creating a more connected ecosystem.
- Automation: APIs automate tasks, reducing manual effort and increasing productivity.
- Improved user experience: APIs can personalize user experiences by integrating data from various sources, providing tailored recommendations and services.

### Examples of APIs in action
- Social media logins: APIs allow to login to websites using Facebook or Google accounts.
- Mapping services: APIs like Google Maps provide location data and navigation instructions to various applications.
- E-commerce platforms: APIs enable the integration of payment gateways, shipping services and inventory management systems.
- Weather apps: APIs fetch real-time weather data from weather services and display it in a user-friendly format.

# Web APIs
Web APIs are a specific type of API that uses web technologies like HTTP to communicate over the internet. They act as intermediaries, allowing different software systems to exchange data and functionality through web protocols.

### Key characteristics
- HTTP-based: Web APIs primarily use the Hypertext Transfer Protocol (HTTP) for communication. This ensures compatibility across various platforms and programming languages.
- Data formats: Common data formats for exchanging information include,
    - JSON (JavaScript Object Notation): A lightweight and human-readable format for representing structured data.
    - XML (eXtensible Markup Language): A more verbose format for describing data using tags.
- Endpoints: Web APIs expose specific URLs (endpoints) that clients can use to interact with the API. These endpoints define the type of request (e.g., `GET`, `POST`, `PUT`, `DELETE`) and the expected data format.

### How web APIs work?
1. Request: A client application (like a web browser or mobile app) sends a request to a specific URL (endpoint) of the web API.
2. Processing: The web API server receives the request, processes it (e.g., fetches data from a databse, performs calculations) and prepares a response.
3. Response: The server sends a response back to the client application, containing the requested data or information in the chosen format.

### Why are web APIs important?
- Powering modern applications: They are the foundation of many modern web and mobile applications, enabling features like,
    - Social logins: Using Google or Facebook account to login to other sites.
    - Maps and navigation: Integrating maps and location services into apps.
    - E-commerce: Handling payments, shipping and inventory.
    - Streaming services: Providing video and audio content.
- Microservices architecture: Web APIs are crucial for building microservices-based systems, where applications are broken down into smaller, independent services.
- Data sharing and integration: They facilitate seamless data exchange between different systems and organizations.

# What are REST APIs?
REST stands for Representational State Transfer APIs. They are a specific type of web API that adhere to the principles of the REST architectural sytle. They have become incredibly popular for building modern web and mobile applications.

### Key characteristics of REST APIs
- Resource oriented: Everything in a REST API is treated as a "resouce" (e.g., a user, a product, a blog post).
- HTTP methods: Utilize standard HTTP methods like,
    - `GET`: Retrieve a resource.
    - `POST`: Create a new resource.
    - `PUT`: Update an existing resource.
    - `DELETE`: Remove a resource.
- Stateless: Each request from a client is independent. The server doesn't maintain any session information between requests.
- Client-Server Architecture: Clear separation between the client (making the request) and the server (providing the resource).
- Uniform interface: Uses a consistent set of rules and conventions for how clients interact with the API.

### Why are REST APIs so popular?
- Simplicity: Relatively easy to understand and implement compared to other API styles.
- Flexibility: Can be used with various programming languages and platforms.
- Scalability: Can handle a large number of requests efficiently.
- Performance: Generally fast and efficient due to their stateless nature.

### Example
Consider that there is blog. A REST API for the blog might have endpoints like,
- `GET /posts`: Retrieve a list of all blog posts.
- `GET /posts/123`: Retrieve a specific blog post with the ID 123.
- `POST /posts`: Create a new blog post.
- `PUT /posts/123`: Update the blog post with ID 123.
- `DELETE /posts/123`: Delete the blog post with ID 123.

# `requests` Package
The `requests` package in Python is a powerful tool for making HTTP requests. It simplifies the process of interacting with web servers, making it easier to fetch data from APIs, send data to web applications and perform other web-related tasks.

### Key features
- Easy to use: The `requests` package provides a simple and intuitive interface for making HTTP requests.
- Supports various HTTP methods: It supports all standard HTTP methods, including `GET`, `POST`, `PUT`, `DELETE`, `HEAD` and `OPTIONS`/
- Handles responses: Easily access and handle to response data, including status codes, headers and content of the response.
- Handles authentication: Supports various authentication mechanisms, such as basic authentication, digest authentication and OAuth.
- Handles cookies: Automatically handles cookies, simplifying interactions with websites that require session management.
- Session management: Provides a `Session` object for managing persistent connections and cookies across multiple requests.

### Installation
`pip install requests`

### Making a `GET` request

In [1]:
import requests

url = "https://api.ipify.org"
response = requests.get(url) 
response

<Response [200]>

### Key methods
- `requests.get()`: Used for making `GET` requests to retrieve data from a server.
- `requests.post()`: Used for sending data to a server (e.g., submitting forms).
- `requests.put()`: Used for updating existing resources on a server.
- `requests.delete()`: Used for deleting resources from a server.
- `response.json()`: Parses the response as a string.
- `response.status_code`: The HTTP status code of the server's response (e.g., 200 for success, 404 for not found).
- `response.text`: The content of the response as a string.
- `response.headers`: A dictionary containing the HTTP headers of the response.

In [2]:
response.status_code

200

In [3]:
response.text

'43.224.129.75'

In [4]:
response.headers

{'Date': 'Sun, 19 Jan 2025 18:19:23 GMT', 'Content-Type': 'text/plain', 'Content-Length': '13', 'Connection': 'keep-alive', 'Vary': 'Origin', 'cf-cache-status': 'DYNAMIC', 'Server': 'cloudflare', 'CF-RAY': '9048d52fa83de6d8-BLR', 'server-timing': 'cfL4;desc="?proto=TCP&rtt=6690&min_rtt=6198&rtt_var=2048&sent=5&recv=8&lost=0&retrans=0&sent_bytes=2834&recv_bytes=763&delivery_rate=545808&cwnd=252&unsent_bytes=0&cid=479c31b52590afcc&ts=272&x=0"'}

In [5]:
if response.status_code == 200:
    print(response.text)
else:
    print(f"Request failed with status code: {response.status_code}")

43.224.129.75


# GitHub API
Documentation: https://docs.github.com/en/rest?apiVersion=2022-11-28

In [6]:
# the following is the base url through which further endpoints are accessed from
base_url = "https://api.github.com/"

### Extracting profile information

In [7]:
# accessing my github profile
my_user_name = "vidishsirdesai"
url = base_url + "users/" + my_user_name
url

'https://api.github.com/users/vidishsirdesai'

In [8]:
response = requests.get(url)
response.text

'{"login":"vidishsirdesai","id":76195985,"node_id":"MDQ6VXNlcjc2MTk1OTg1","avatar_url":"https://avatars.githubusercontent.com/u/76195985?v=4","gravatar_id":"","url":"https://api.github.com/users/vidishsirdesai","html_url":"https://github.com/vidishsirdesai","followers_url":"https://api.github.com/users/vidishsirdesai/followers","following_url":"https://api.github.com/users/vidishsirdesai/following{/other_user}","gists_url":"https://api.github.com/users/vidishsirdesai/gists{/gist_id}","starred_url":"https://api.github.com/users/vidishsirdesai/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/vidishsirdesai/subscriptions","organizations_url":"https://api.github.com/users/vidishsirdesai/orgs","repos_url":"https://api.github.com/users/vidishsirdesai/repos","events_url":"https://api.github.com/users/vidishsirdesai/events{/privacy}","received_events_url":"https://api.github.com/users/vidishsirdesai/received_events","type":"User","user_view_type":"public","site_admin":

In [9]:
# the response obtained is in the form of a dictionary with key-value pairs
# this form of data structure is called as JSON
# in order to beautify the output to a more readable format, the output can be jsonified

output = response.json()
output

{'login': 'vidishsirdesai',
 'id': 76195985,
 'node_id': 'MDQ6VXNlcjc2MTk1OTg1',
 'avatar_url': 'https://avatars.githubusercontent.com/u/76195985?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/vidishsirdesai',
 'html_url': 'https://github.com/vidishsirdesai',
 'followers_url': 'https://api.github.com/users/vidishsirdesai/followers',
 'following_url': 'https://api.github.com/users/vidishsirdesai/following{/other_user}',
 'gists_url': 'https://api.github.com/users/vidishsirdesai/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/vidishsirdesai/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/vidishsirdesai/subscriptions',
 'organizations_url': 'https://api.github.com/users/vidishsirdesai/orgs',
 'repos_url': 'https://api.github.com/users/vidishsirdesai/repos',
 'events_url': 'https://api.github.com/users/vidishsirdesai/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/vidishsirdesai/received_events',
 'type'

In [10]:
# or the following can also be used to jsonify
import json

output = json.loads(response.text)
output

{'login': 'vidishsirdesai',
 'id': 76195985,
 'node_id': 'MDQ6VXNlcjc2MTk1OTg1',
 'avatar_url': 'https://avatars.githubusercontent.com/u/76195985?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/vidishsirdesai',
 'html_url': 'https://github.com/vidishsirdesai',
 'followers_url': 'https://api.github.com/users/vidishsirdesai/followers',
 'following_url': 'https://api.github.com/users/vidishsirdesai/following{/other_user}',
 'gists_url': 'https://api.github.com/users/vidishsirdesai/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/vidishsirdesai/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/vidishsirdesai/subscriptions',
 'organizations_url': 'https://api.github.com/users/vidishsirdesai/orgs',
 'repos_url': 'https://api.github.com/users/vidishsirdesai/repos',
 'events_url': 'https://api.github.com/users/vidishsirdesai/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/vidishsirdesai/received_events',
 'type'

This data is now available for use in which ever way.

### Save the profile picture

In [11]:
# to collect the user's profile picture data for a CV model
output["avatar_url"]

'https://avatars.githubusercontent.com/u/76195985?v=4'

In [12]:
# saving the profile picture

# create a file
file = open(f"{my_user_name}.png", "wb")

# send a GET request
response_image = requests.get(output["avatar_url"])

file.write(response_image.content)
file.close()

### Extracting details about the logged in user

In [13]:
url = base_url + "user"
response = requests.get(url)
response

<Response [401]>

In [14]:
# investigate the error
json.loads(response.content)

{'message': 'Requires authentication',
 'documentation_url': 'https://docs.github.com/rest/users/users#get-the-authenticated-user',
 'status': '401'}

In [15]:
# authenticating
# while logging in using Python, tokens are generated
# goto, GitHub Profile > Settings > Developer settings > Personal access tokens > Tokens (classic) > Generate new tokens > Generate new tokens (classic)
# enter the authentication code and in the next page, set customizations and generate token
url

'https://api.github.com/user'

In [16]:
token = "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"

In [17]:
headers = {
    "authorization": "Bearer {}".format(token)
}

In [18]:
response = requests.get(url, headers = headers)
response

<Response [200]>

In [19]:
response.content

b'{"login":"vidishsirdesai","id":76195985,"node_id":"MDQ6VXNlcjc2MTk1OTg1","avatar_url":"https://avatars.githubusercontent.com/u/76195985?v=4","gravatar_id":"","url":"https://api.github.com/users/vidishsirdesai","html_url":"https://github.com/vidishsirdesai","followers_url":"https://api.github.com/users/vidishsirdesai/followers","following_url":"https://api.github.com/users/vidishsirdesai/following{/other_user}","gists_url":"https://api.github.com/users/vidishsirdesai/gists{/gist_id}","starred_url":"https://api.github.com/users/vidishsirdesai/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/vidishsirdesai/subscriptions","organizations_url":"https://api.github.com/users/vidishsirdesai/orgs","repos_url":"https://api.github.com/users/vidishsirdesai/repos","events_url":"https://api.github.com/users/vidishsirdesai/events{/privacy}","received_events_url":"https://api.github.com/users/vidishsirdesai/received_events","type":"User","user_view_type":"private","site_admin

In [20]:
response.text

'{"login":"vidishsirdesai","id":76195985,"node_id":"MDQ6VXNlcjc2MTk1OTg1","avatar_url":"https://avatars.githubusercontent.com/u/76195985?v=4","gravatar_id":"","url":"https://api.github.com/users/vidishsirdesai","html_url":"https://github.com/vidishsirdesai","followers_url":"https://api.github.com/users/vidishsirdesai/followers","following_url":"https://api.github.com/users/vidishsirdesai/following{/other_user}","gists_url":"https://api.github.com/users/vidishsirdesai/gists{/gist_id}","starred_url":"https://api.github.com/users/vidishsirdesai/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/vidishsirdesai/subscriptions","organizations_url":"https://api.github.com/users/vidishsirdesai/orgs","repos_url":"https://api.github.com/users/vidishsirdesai/repos","events_url":"https://api.github.com/users/vidishsirdesai/events{/privacy}","received_events_url":"https://api.github.com/users/vidishsirdesai/received_events","type":"User","user_view_type":"private","site_admin"

In [21]:
response.json()

{'login': 'vidishsirdesai',
 'id': 76195985,
 'node_id': 'MDQ6VXNlcjc2MTk1OTg1',
 'avatar_url': 'https://avatars.githubusercontent.com/u/76195985?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/vidishsirdesai',
 'html_url': 'https://github.com/vidishsirdesai',
 'followers_url': 'https://api.github.com/users/vidishsirdesai/followers',
 'following_url': 'https://api.github.com/users/vidishsirdesai/following{/other_user}',
 'gists_url': 'https://api.github.com/users/vidishsirdesai/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/vidishsirdesai/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/vidishsirdesai/subscriptions',
 'organizations_url': 'https://api.github.com/users/vidishsirdesai/orgs',
 'repos_url': 'https://api.github.com/users/vidishsirdesai/repos',
 'events_url': 'https://api.github.com/users/vidishsirdesai/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/vidishsirdesai/received_events',
 'type'

In [22]:
json.loads(response.content)

{'login': 'vidishsirdesai',
 'id': 76195985,
 'node_id': 'MDQ6VXNlcjc2MTk1OTg1',
 'avatar_url': 'https://avatars.githubusercontent.com/u/76195985?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/vidishsirdesai',
 'html_url': 'https://github.com/vidishsirdesai',
 'followers_url': 'https://api.github.com/users/vidishsirdesai/followers',
 'following_url': 'https://api.github.com/users/vidishsirdesai/following{/other_user}',
 'gists_url': 'https://api.github.com/users/vidishsirdesai/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/vidishsirdesai/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/vidishsirdesai/subscriptions',
 'organizations_url': 'https://api.github.com/users/vidishsirdesai/orgs',
 'repos_url': 'https://api.github.com/users/vidishsirdesai/repos',
 'events_url': 'https://api.github.com/users/vidishsirdesai/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/vidishsirdesai/received_events',
 'type'

### Creating repositories using GitHub API

In [23]:
base_url

'https://api.github.com/'

In [24]:
url = base_url + "user/repos"
url

'https://api.github.com/user/repos'

In [25]:
data = {
    "name": "api_demo",
    "description": "A demo repository created to learn how repos are created using GitHub API.",
    "private": False
}

In [26]:
response = requests.post(url, headers = headers, json = data)
response

<Response [201]>

### Deleting the created repository

In [27]:
base_url

'https://api.github.com/'

In [28]:
user = "vidishsirdesai"
repo = "api_demo"

In [29]:
url = base_url + f"repos/{user}/{repo}"
url

'https://api.github.com/repos/vidishsirdesai/api_demo'

In [30]:
response = requests.delete(url, headers = headers)
response

<Response [204]>

### List all users

In [31]:
base_url

'https://api.github.com/'

In [32]:
url = base_url + "users"
url

'https://api.github.com/users'

In [33]:
# data = {
#     "Accept": "application/vnd.github+json",
#     "X-GitHub-Api-Version": "2022-11-28"
# }

In [34]:
response = requests.get(url, headers = headers)
response

<Response [200]>

In [35]:
response.text

'[{"login":"mojombo","id":1,"node_id":"MDQ6VXNlcjE=","avatar_url":"https://avatars.githubusercontent.com/u/1?v=4","gravatar_id":"","url":"https://api.github.com/users/mojombo","html_url":"https://github.com/mojombo","followers_url":"https://api.github.com/users/mojombo/followers","following_url":"https://api.github.com/users/mojombo/following{/other_user}","gists_url":"https://api.github.com/users/mojombo/gists{/gist_id}","starred_url":"https://api.github.com/users/mojombo/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/mojombo/subscriptions","organizations_url":"https://api.github.com/users/mojombo/orgs","repos_url":"https://api.github.com/users/mojombo/repos","events_url":"https://api.github.com/users/mojombo/events{/privacy}","received_events_url":"https://api.github.com/users/mojombo/received_events","type":"User","user_view_type":"public","site_admin":false},{"login":"defunkt","id":2,"node_id":"MDQ6VXNlcjI=","avatar_url":"https://avatars.githubusercontent

In [36]:
data = response.json()
data

[{'login': 'mojombo',
  'id': 1,
  'node_id': 'MDQ6VXNlcjE=',
  'avatar_url': 'https://avatars.githubusercontent.com/u/1?v=4',
  'gravatar_id': '',
  'url': 'https://api.github.com/users/mojombo',
  'html_url': 'https://github.com/mojombo',
  'followers_url': 'https://api.github.com/users/mojombo/followers',
  'following_url': 'https://api.github.com/users/mojombo/following{/other_user}',
  'gists_url': 'https://api.github.com/users/mojombo/gists{/gist_id}',
  'starred_url': 'https://api.github.com/users/mojombo/starred{/owner}{/repo}',
  'subscriptions_url': 'https://api.github.com/users/mojombo/subscriptions',
  'organizations_url': 'https://api.github.com/users/mojombo/orgs',
  'repos_url': 'https://api.github.com/users/mojombo/repos',
  'events_url': 'https://api.github.com/users/mojombo/events{/privacy}',
  'received_events_url': 'https://api.github.com/users/mojombo/received_events',
  'type': 'User',
  'user_view_type': 'public',
  'site_admin': False},
 {'login': 'defunkt',
  '

In [37]:
len(data)

30

### Creating a DataFrame of the collected data

In [38]:
data[0]["login"], data[0]["avatar_url"], data[0]["url"], data[0]["repos_url"]

('mojombo',
 'https://avatars.githubusercontent.com/u/1?v=4',
 'https://api.github.com/users/mojombo',
 'https://api.github.com/users/mojombo/repos')

In [39]:
data[1]["login"], data[1]["avatar_url"], data[1]["url"], data[1]["repos_url"]

('defunkt',
 'https://avatars.githubusercontent.com/u/2?v=4',
 'https://api.github.com/users/defunkt',
 'https://api.github.com/users/defunkt/repos')

In [40]:
data_dict = {
    "user_name": [],
    "avatar_url": [],
    "url": [],
    "repos_url": []
}

for user in data:
    data_dict["user_name"].append(user["login"])
    data_dict["avatar_url"].append(user["avatar_url"])
    data_dict["url"].append(user["url"])
    data_dict["repos_url"].append(user["repos_url"])

data_dict

{'user_name': ['mojombo',
  'defunkt',
  'pjhyett',
  'wycats',
  'ezmobius',
  'ivey',
  'evanphx',
  'vanpelt',
  'wayneeseguin',
  'brynary',
  'kevinclark',
  'technoweenie',
  'macournoyer',
  'takeo',
  'caged',
  'topfunky',
  'anotherjesse',
  'roland',
  'lukas',
  'fanvsfan',
  'tomtt',
  'railsjitsu',
  'nitay',
  'kevwil',
  'KirinDave',
  'jamesgolick',
  'atmos',
  'errfree',
  'mojodna',
  'bmizerany'],
 'avatar_url': ['https://avatars.githubusercontent.com/u/1?v=4',
  'https://avatars.githubusercontent.com/u/2?v=4',
  'https://avatars.githubusercontent.com/u/3?v=4',
  'https://avatars.githubusercontent.com/u/4?v=4',
  'https://avatars.githubusercontent.com/u/5?v=4',
  'https://avatars.githubusercontent.com/u/6?v=4',
  'https://avatars.githubusercontent.com/u/7?v=4',
  'https://avatars.githubusercontent.com/u/17?v=4',
  'https://avatars.githubusercontent.com/u/18?v=4',
  'https://avatars.githubusercontent.com/u/19?v=4',
  'https://avatars.githubusercontent.com/u/20?v=4',

In [41]:
import pandas as pd

df = pd.DataFrame(data_dict)
df.head()

Unnamed: 0,user_name,avatar_url,url,repos_url
0,mojombo,https://avatars.githubusercontent.com/u/1?v=4,https://api.github.com/users/mojombo,https://api.github.com/users/mojombo/repos
1,defunkt,https://avatars.githubusercontent.com/u/2?v=4,https://api.github.com/users/defunkt,https://api.github.com/users/defunkt/repos
2,pjhyett,https://avatars.githubusercontent.com/u/3?v=4,https://api.github.com/users/pjhyett,https://api.github.com/users/pjhyett/repos
3,wycats,https://avatars.githubusercontent.com/u/4?v=4,https://api.github.com/users/wycats,https://api.github.com/users/wycats/repos
4,ezmobius,https://avatars.githubusercontent.com/u/5?v=4,https://api.github.com/users/ezmobius,https://api.github.com/users/ezmobius/repos


In [42]:
df.to_csv("github_users.csv", index = False)