<a href="https://colab.research.google.com/github/saad-ameer/Python-for-Data-Analyst/blob/main/data_sources.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pandas Notes – Importing and Exporting Excel/CSV Files

---

## Importing Files

Pandas provides two main methods:

- `pd.read_csv('file.csv')`
- `pd.read_excel('file.xlsx', sheet_name='Sheet1')`

### Key Points

- If the file is in the **same directory**, use only the filename.
- If the file is in a **different directory**, provide the full file path.
- `read_excel()` often requires the `sheet_name` argument.

### Example

```python
import pandas as pd

# Read CSV
countries_csv = pd.read_csv('top10_countries.csv')

# Read Excel
countries_excel = pd.read_excel('top10_countries.xlsx', sheet_name='Sheet1')
```

---

## Differences in Format

- The two files (CSV and Excel) may contain the **same data** but differ slightly in **format** due to:
  - Source formatting
  - Export tool differences

You can inspect both using:

```python
print(countries_csv.head())
print(countries_excel.head())
```

---

## Exporting Files

You can export any DataFrame to CSV or Excel using:

- `df.to_csv('filename.csv', index=False)`
- `df.to_excel('filename.xlsx', index=False)`

**Note**: `index=False` prevents the index column from being saved into the file.

### Example

```python
# Suppose countries_out is a grouped DataFrame
countries_out = countries_excel.groupby('region')['population'].sum().reset_index()

# Export to CSV
countries_out.to_csv('/Users/yourname/Desktop/countries_out.csv', index=False)

# Export to Excel
countries_out.to_excel('/Users/yourname/Desktop/countries_out.xlsx', index=False)
```

- If only the file name is given, the file is saved in the **current working directory**.
- If a **full path** is given, the file is saved in the specified directory.

---

## Summary

| Operation          | Function           | Notes                             |
|--------------------|--------------------|-----------------------------------|
| Import CSV         | `read_csv()`       | Can take delimiter, headers, etc. |
| Import Excel       | `read_excel()`     | Often needs `sheet_name`          |
| Export to CSV      | `to_csv()`         | Use `index=False` to skip index   |
| Export to Excel    | `to_excel()`       | Same as above                     |

---

## Useful Docs

- [Pandas `read_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)
- [Pandas `read_excel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)
- [Pandas `to_csv`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)
- [Pandas `to_excel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_excel.html)

In [1]:
import pandas as pd

In [2]:
countries_csv = pd.read_csv('top_10_countries.csv')

In [3]:
countries_csv

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,1,China,Asia,1412600000,17.80%,31-Dec-21
1,2,India,Asia,1386946912,17.50%,18-Jan-22
2,3,United States,Americas,333073186,4.20%,18-Jan-22
3,4,Indonesia,Asia,271350000,3.42%,31-Dec-20
4,5,Pakistan,Asia,225200000,2.84%,01-Jul-21
5,6,Brazil,Americas,214231641,2.70%,18-Jan-22
6,7,Nigeria,Africa,211401000,2.67%,01-Jul-21
7,8,Bangladesh,Asia,172062576,2.17%,18-Jan-22
8,9,Russia,Europe,146171015,1.84%,01-Jan-21
9,10,Mexico,Americas,126014024,1.59%,02-Mar-20


In [4]:
countries_excel = pd.read_excel('top_10_countries.xls', sheet_name='data')

In [5]:
countries_excel

Unnamed: 0,Rank,Country / Dependency,Region,Population,% of world,Date
0,1,China,Asia,1412600000,0.178,2021-12-31
1,2,India,Asia,1386946912,0.175,2022-01-18
2,3,United States,Americas,333073186,0.042,2022-01-18
3,4,Indonesia,Asia,271350000,0.0342,2020-12-31
4,5,Pakistan,Asia,225200000,0.0284,2021-07-01
5,6,Brazil,Americas,214231641,0.027,2022-01-18
6,7,Nigeria,Africa,211401000,0.0267,2021-07-01
7,8,Bangladesh,Asia,172062576,0.0217,2022-01-18
8,9,Russia,Europe,146171015,0.0184,2021-01-01
9,10,Mexico,Americas,126014024,0.0159,2020-03-02


In [7]:
#countries_json = pd.read_json('data.json')

In [8]:
countries_out = countries_excel.pivot_table(index='Region',values='Population',aggfunc='sum')

In [9]:
countries_out

Unnamed: 0_level_0,Population
Region,Unnamed: 1_level_1
Africa,211401000
Americas,673318851
Asia,3468159488
Europe,146171015


In [20]:
countries_out.to_csv('countries_out.csv')

In [21]:
countries_out.to_excel('countries_out.xlsx')

# Pandas Notes – Reading HTML Tables

---

## Reading Tables from Websites with `read_html`

Pandas provides a built-in method called `pd.read_html()` that allows you to read HTML tables directly from a webpage into a **list of DataFrame objects**.

### Key Concepts

- `read_html()` scans the given URL for any `<table>` elements.
- Each table is converted into a DataFrame and returned as a list.
- You can use **indexing** to select the correct table or use the `match` argument to narrow down results.

---

## Example: Top Grossing Movies on Wikipedia

### Step 1: Import Libraries and Define URL

```python
import pandas as pd

url = "https://en.wikipedia.org/wiki/List_of_highest-grossing_films"
```

### Step 2: Read All Tables

```python
movies = pd.read_html(url)
type(movies)  # Output: list
```

### Step 3: Inspect Tables

```python
# Loop to identify the desired table
for i, table in enumerate(movies):
    print(f"Table {i}")
    print(table.head())
```

Or just check manually:

```python
# Check specific table by index
movies[2].head()  # This might be the "Highest-grossing films by year of release"
```

---

## Using the `match` Argument

Instead of scanning each table manually, you can use the `match` parameter to filter tables that contain specific text:

```python
# Use match to filter tables that contain a certain string
movies_filtered = pd.read_html(url, match="Highest-grossing films by year of release")

# The result is still a list; select the first item
df = movies_filtered[0]
df.head()
```

---

## Summary

| Feature         | Usage                                       |
|-----------------|---------------------------------------------|
| Basic Read      | `pd.read_html(url)`                         |
| Filter with Text| `pd.read_html(url, match="text in table")`  |
| Output Format   | List of DataFrames                          |
| Table Selection | Index into list or use loop to explore      |

---

## Useful Tips

- If the website doesn't allow scraping, `read_html()` may fail.
- This method uses **`lxml`**, **`html5lib`**, or **`bs4`** under the hood. Make sure these libraries are installed.
- For more advanced web scraping (e.g., dynamic content), consider using:
  - [`requests`](https://docs.python-requests.org/)
  - [`BeautifulSoup`](https://www.crummy.com/software/BeautifulSoup/)

---

## Documentation

- [Pandas `read_html`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_html.html)

In [22]:
import pandas as pd

In [23]:
url = 'https://en.wikipedia.org/wiki/FIFA_World_Cup_awards'

In [24]:
winners = pd.read_html(url)

In [25]:
type(winners)

list

In [28]:
winners[0]

Unnamed: 0,World Cup,Golden Ball,Silver Ball,Bronze Ball
0,1982 Spain,Paolo Rossi,Falcão,Karl-Heinz Rummenigge
1,1986 Mexico,Diego Maradona,Harald Schumacher,Preben Elkjær
2,1990 Italy,Salvatore Schillaci,Lothar Matthäus,Diego Maradona
3,1994 United States,Romário,Roberto Baggio,Hristo Stoichkov
4,1998 France,Ronaldo,Davor Šuker,Lilian Thuram
5,2002 South Korea/Japan,Oliver Kahn,Ronaldo,Hong Myung-bo
6,2006 Germany,Zinedine Zidane,Fabio Cannavaro,Andrea Pirlo
7,2010 South Africa,Diego Forlán,Wesley Sneijder,David Villa
8,2014 Brazil,Lionel Messi,Thomas Müller,Arjen Robben
9,2018 Russia,Luka Modrić,Eden Hazard,Antoine Griezmann


In [29]:
winners[1]

Unnamed: 0,Nation,Gold,Silver,Bronze,Total
0,Argentina,3,0,1,4
1,Italy,2,2,1,5
2,Brazil,2,2,0,4
3,West Germany/Germany,1,3,1,5
4,France,1,1,2,4
5,Croatia,1,1,1,3
6,Uruguay,1,0,0,1
7,Netherlands,0,1,1,2
8,Belgium,0,1,0,1
9,Bulgaria,0,0,1,1


In [35]:
winners[7]

Unnamed: 0,World Cup,FIFA Fair Play Trophy winners
0,1970 Mexico,Peru
1,1974 West Germany,West Germany
2,1978 Argentina,Argentina
3,1982 Spain,Brazil
4,1986 Mexico,Brazil
5,1990 Italy,England
6,1994 United States,Brazil
7,1998 France,England France
8,2002 South Korea/Japan,Belgium
9,2006 Germany,Brazil Spain


In [37]:
winners = pd.read_html(url, match='Golden Ball')

In [40]:
winners[0]

Unnamed: 0,World Cup,Golden Ball,Silver Ball,Bronze Ball
0,1982 Spain,Paolo Rossi,Falcão,Karl-Heinz Rummenigge
1,1986 Mexico,Diego Maradona,Harald Schumacher,Preben Elkjær
2,1990 Italy,Salvatore Schillaci,Lothar Matthäus,Diego Maradona
3,1994 United States,Romário,Roberto Baggio,Hristo Stoichkov
4,1998 France,Ronaldo,Davor Šuker,Lilian Thuram
5,2002 South Korea/Japan,Oliver Kahn,Ronaldo,Hong Myung-bo
6,2006 Germany,Zinedine Zidane,Fabio Cannavaro,Andrea Pirlo
7,2010 South Africa,Diego Forlán,Wesley Sneijder,David Villa
8,2014 Brazil,Lionel Messi,Thomas Müller,Arjen Robben
9,2018 Russia,Luka Modrić,Eden Hazard,Antoine Griezmann


# Connecting to a MySQL Database Using Python and Pandas

---

## Overview

This demonstration illustrates how to **connect Python to a MySQL database** and **read data into a Pandas DataFrame**. Python supports connecting to various data sources, including relational and NoSQL databases.

---

## Questions to Consider Before Connecting

To determine how to connect to your database, ask the following:

- What **type of database** is it?  
  - Relational (e.g., MySQL, PostgreSQL, Oracle)
  - NoSQL (e.g., MongoDB)
- What **DBMS (Database Management System)** are you using?
- Will you need support from a **Database Administrator (DBA)** for access credentials or VPN/firewall settings?

---

## Required Libraries

For this demonstration:

- **`pymysql`**: Python client for connecting to MySQL databases.
- **`pandas.read_sql()`**: Executes SQL queries and loads the result into a DataFrame.

Other databases require different libraries:
- Oracle: `cx_Oracle`
- PostgreSQL: `psycopg2`
- Microsoft SQL Server: `pyodbc` or `sqlalchemy` with appropriate driver

---

## Example: Connect to a Local MySQL Database

### Step 1: Import Libraries

```python
import pandas as pd
import pymysql
```

### Step 2: Set Up the Connection

```python
# Example values
password = "your_password_here"

# Create a connection object
con = pymysql.connect(
    host="localhost",
    user="root",
    password=password,
    database="employees"
)
```

### Step 3: Query the Database

```python
# Execute SQL query and store result in DataFrame
query = "SELECT * FROM employees"
employees = pd.read_sql(query, con)
```

### Step 4: View the Data

```python
# View first few rows
print(employees.head())

# Get number of records
print(len(employees))  # e.g., 300024 records
```

### Step 5: Close the Connection

```python
con.close()
```

---

## Summary

| Task                     | Tool/Method                         |
|--------------------------|-------------------------------------|
| Connect to MySQL         | `pymysql.connect()`                |
| Run SQL query            | `pd.read_sql(query, connection)`   |
| Close connection         | `connection.close()`               |
| Required library         | `pymysql`                          |

---

## Notes

- This is just a demonstration using a **local MySQL server** and a sample `employees` database.
- For production use, make sure to:
  - Secure your credentials
  - Handle exceptions
  - Consider connection pooling
- Use documentation and tutorials specific to your database system and environment.

---

## Useful References

- [Pandas `read_sql`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_sql.html)
- [PyMySQL Documentation](https://pymysql.readthedocs.io/)
- [MySQL Sample Employees Database](https://dev.mysql.com/doc/employee/en/)

In [43]:
import pandas as pd
#import pymysql

In [42]:
!pip install pymysql

Collecting pymysql
  Downloading PyMySQL-1.1.1-py3-none-any.whl.metadata (4.4 kB)
Downloading PyMySQL-1.1.1-py3-none-any.whl (44 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/45.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.0/45.0 kB[0m [31m3.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymysql
Successfully installed pymysql-1.1.1


In [44]:
import pymysql

In [46]:
#con = pymysql.connect(host='localhost',user='root',password='',database='hello')

In [48]:
#hello = pd.read_sql('SELECT * FROM hello',con)

In [49]:
#hello.head()

In [50]:
#len(hello)

In [51]:
#con.close()

# Pandas Input and Output Methods (I/O)

---

## Overview

Pandas provides a wide range of **input and output (I/O) methods** for reading from and writing to various file formats. So far, we've covered:

- `pd.read_csv()` – Read CSV files
- `pd.read_excel()` – Read Excel files
- `pd.read_html()` – Read HTML tables from web pages
- `pd.read_sql()` – Read SQL query or database table into a DataFrame
- `df.to_csv()` – Export DataFrame to CSV
- `df.to_excel()` – Export DataFrame to Excel

---

## Pandas I/O Documentation

There is an **official I/O documentation page** provided by Pandas which lists all supported formats. It includes I/O support for:

- Text files: CSV, JSON, HTML, XML
- Binary files: Excel, HDF5, Parquet, Msgpack, Feather
- SQL databases
- Google BigQuery
- Clipboard
- Pickle serialization

You can view it here:
- [Pandas I/O Tools Documentation](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html)

---

## Why You Should Use It

- You'll encounter various data formats in real-world projects.
- This page gives **function-specific arguments** like `delimiter`, `sheet_name`, `usecols`, etc.
- Saves time by avoiding trial-and-error or incomplete usage.
- Essential reference when working with non-standard formats (e.g. Stata, SAS, SPSS, ORC, Avro).

---

## Summary

| File Type        | Read Method         | Write Method        |
|------------------|---------------------|---------------------|
| CSV              | `read_csv()`        | `to_csv()`          |
| Excel            | `read_excel()`      | `to_excel()`        |
| HTML             | `read_html()`       | —                   |
| SQL              | `read_sql()`        | `to_sql()`          |
| JSON             | `read_json()`       | `to_json()`         |
| Parquet          | `read_parquet()`    | `to_parquet()`      |
| Clipboard        | `read_clipboard()`  | `to_clipboard()`    |
| Pickle           | `read_pickle()`     | `to_pickle()`       |

