<a href="https://colab.research.google.com/github/seremmartin64-ops/ML/blob/main/Data_Collection_Techniques_in_Pandas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 📘 Data Collection in Python

Data collection is the process of gathering data from various sources to perform analysis, build models, or generate insights.  
Python provides several powerful libraries, such as **pandas**, that make it easy to read and work with data from different file formats.

In this notebook, we'll explore how to **read data from CSV, Excel, and Text files** using pandas.



## 🧾 Reading CSV Files with `read_csv()`

The `pandas.read_csv()` function is used to load data from a **Comma-Separated Values (CSV)** file into a **DataFrame**.

A CSV file is a plain text file that stores tabular data in a simple format where values are separated by commas.

**Common Syntax:**
```python
import pandas as pd

df = pd.read_csv("filename.csv")
```

**Common Parameters:**
- `sep`: Specifies the delimiter (default is comma `,`).
- `header`: Row number to use as column names (default is 0).
- `index_col`: Column to use as the row labels.
- `usecols`: Select specific columns to read.
- `nrows`: Read only a specific number of rows.

**Example:**
```python
import pandas as pd

# Read CSV file
df = pd.read_csv("data/sample_data.csv")

# Display first 5 rows
df.head()
```



## 📊 Reading Excel Files with `read_excel()`

The `pandas.read_excel()` function is used to load data from **Microsoft Excel files** (`.xlsx` or `.xls` formats).

**Common Syntax:**
```python
import pandas as pd

df = pd.read_excel("filename.xlsx")
```

**Common Parameters:**
- `sheet_name`: Specifies the sheet to read (can be a name, index, or list of sheets).
- `usecols`: Select specific columns to import.
- `nrows`: Limit the number of rows read.
- `skiprows`: Skip specific rows at the beginning of the file.

**Example:**
```python
import pandas as pd

# Read Excel file
df = pd.read_excel("data/sales_data.xlsx", sheet_name="Q1 Sales")

# Display basic info
df.info()
```



## 📄 Reading Text Files with `read_table()` or `read_csv()`

Text files may not always use commas as delimiters. They might use tabs (`\t`), spaces, or other characters.

You can use either `pd.read_table()` or `pd.read_csv()` with a custom separator.

**Example using `read_table()`:**
```python
import pandas as pd

df = pd.read_table("data/custom_data.txt", sep="\t")
df.head()
```

**Example using `read_csv()` with custom delimiter:**
```python
import pandas as pd

df = pd.read_csv("data/space_separated.txt", sep=" ")
df.head()
```

**Key Points:**
- `sep` defines how values are separated.
- You can read many text formats by adjusting the separator.
- Use `encoding='utf-8'` if you encounter encoding errors.



## ✅ Summary

| Function | File Type | Description |
|-----------|------------|-------------|
| `read_csv()` | `.csv` | Reads comma-separated values files |
| `read_excel()` | `.xls`, `.xlsx` | Reads Excel spreadsheet files |
| `read_table()` | `.txt` | Reads text files with a specified delimiter |

Each of these methods loads data into a **pandas DataFrame**, enabling you to manipulate, analyze, and visualize the data effectively.


In [None]:
# Reading a CSV File Data From a Local Storage
import pandas as pd

# Add a utf encoding
# Research on the Encoding Schemes while reading Datasets
# a) utf-8
# b) latin1
superstore_data1 = pd.read_csv("/content/Sample - Superstore.csv", encoding='latin1')
superstore_data1.head()

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96,2,0.0,41.9136
1,2,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94,3,0.0,219.582
2,3,CA-2016-138688,6/12/2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62,2,0.0,6.8714
3,4,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.031
4,5,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368,2,0.2,2.5164


In [None]:
# Reading Excel file from a local storage
import pandas as pd

superstore_data2 = pd.read_excel("/content/Superstore.xlsx", sheet_name = "Returns")
superstore_data2.head()

Unnamed: 0,Returned,Order ID
0,Yes,CA-2017-153822
1,Yes,CA-2017-129707
2,Yes,CA-2014-152345
3,Yes,CA-2015-156440
4,Yes,US-2017-155999


In [None]:
# Reading Datasets from Relational Databases(SQL)