## What is Pandas?

**Pandas** is an open-source data analysis and data manipulation library built on top of Python. It is widely used by data scientists, analysts, and developers for working with structured data easily and efficiently.

### Why Use Pandas?

Pandas simplifies the process of analyzing, cleaning, and visualizing data. It provides flexible and powerful tools to:

* Load and save data from various file formats (like CSV, Excel, JSON, etc.)
* Clean and preprocess data
* Handle missing data
* Filter, sort, and group data
* Work with time series data
* Merge and join datasets

### Core Data Structures

Pandas has two primary data structures:

* **Series**: A one-dimensional labeled array, similar to a single column in a table.
* **DataFrame**: A two-dimensional, tabular data structure with labeled rows and columns, similar to a spreadsheet or SQL table.

These structures make it easy to explore, clean, and transform data.

### Common Use Cases

* Analyzing survey data or reports
* Cleaning messy datasets
* Performing statistical analysis
* Building dashboards and visual reports
* Preparing data for machine learning

### ✅ Benefits of Pandas

* Easy-to-use syntax for beginners and professionals
* Fast and efficient performance, even on large datasets
* Seamless integration with other libraries like NumPy, Matplotlib, and Scikit-learn
* Well-documented and supported by a strong community

### 📚 Learn More

You can explore more through the [official Pandas documentation](https://pandas.pydata.org/docs/).

## Exercise: Inspecting a DataFrame

When working with a new DataFrame, your first task is to **understand its structure and contents**. Here are some helpful methods and attributes to get started:

- `.head()` – Displays the first few rows of the DataFrame.
- `.info()` – Gives a summary of the columns, including data types and counts of missing values.
- `.shape` – Returns a tuple with the number of rows and columns.
- `.describe()` – Provides summary statistics for each numeric column.

The DataFrame `homelessness` contains **estimates of homelessness across U.S. states in 2018**:
- `individuals`: Number of homeless people not in families.
- `family_members`: Number of homeless people in families with children.
- `state_pop`: Total population of each state.

### Instructions:

1. **View the First Rows**  
   Use the `.head()` method to look at the first few entries in the dataset. This helps you get an initial sense of the data.

2. **Check Column Details**  
   Use `.info()` to examine each column’s data type and find out if there are any missing values.

3. **Check Dimensions**  
   Use the `.shape` attribute to find out how many rows and columns the dataset has.

4. **View Summary Statistics**  
   Use `.describe()` to see basic statistical summaries (like mean, min, max) for each numeric column.

In [1]:
import pandas as pd

data = {
    "region": [
        "East South Central", "Pacific", "Mountain", "West South Central", "Pacific", "Mountain", "New England",
        "South Atlantic", "South Atlantic", "South Atlantic", "South Atlantic", "Pacific", "Mountain",
        "East North Central", "East North Central", "West North Central", "West North Central", "East South Central",
        "West South Central", "New England", "South Atlantic", "New England", "East North Central",
        "West North Central", "East South Central", "West North Central", "Mountain", "West North Central",
        "Mountain", "New England", "Mid-Atlantic", "Mountain", "Mid-Atlantic", "South Atlantic",
        "West North Central", "East North Central", "West South Central", "Pacific", "Mid-Atlantic",
        "New England", "South Atlantic", "West North Central", "East South Central", "West South Central",
        "Mountain", "New England", "South Atlantic", "Pacific", "South Atlantic", "East North Central", "Mountain"
    ],
    "state": [
        "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut",
        "Delaware", "District of Columbia", "Florida", "Georgia", "Hawaii", "Idaho",
        "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky",
        "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan",
        "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska",
        "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York",
        "North Carolina", "North Dakota", "Ohio", "Oklahoma", "Oregon",
        "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee",
        "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"
    ],
    "individuals": [
        2570.0, 1434.0, 7259.0, 2280.0, 109008.0, 7607.0, 2280.0,
        708.0, 3770.0, 21443.0, 6943.0, 4131.0, 1297.0,
        6752.0, 3776.0, 1711.0, 1443.0, 2735.0,
        2540.0, 1450.0, 4914.0, 6811.0, 5209.0,
        3993.0, 1024.0, 3776.0, 983.0, 1745.0,
        7058.0, 835.0, 6048.0, 1949.0, 39827.0,
        6451.0, 467.0, 6929.0, 2823.0, 11139.0,
        8163.0, 747.0, 3082.0, 836.0, 6139.0,
        19199.0, 1904.0, 780.0, 3928.0, 16424.0, 1021.0, 2740.0, 434.0
    ],
    "family_members": [
        864.0, 582.0, 2606.0, 432.0, 20964.0, 3250.0, 1696.0,
        374.0, 3134.0, 9587.0, 2556.0, 2399.0, 715.0,
        3891.0, 1482.0, 1038.0, 773.0, 953.0,
        519.0, 1066.0, 2230.0, 13257.0, 3142.0,
        3250.0, 328.0, 2107.0, 422.0, 676.0,
        486.0, 615.0, 3350.0, 602.0, 52070.0,
        2817.0, 75.0, 3320.0, 1048.0, 3337.0,
        5349.0, 354.0, 851.0, 323.0, 1744.0,
        6111.0, 972.0, 511.0, 2047.0, 5880.0, 222.0, 2167.0, 205.0
    ],
    "state_pop": [
        4887681, 735139, 7158024, 3009733, 39461588, 5691287, 3571520,
        965479, 701547, 21244317, 10511131, 1420593, 1750536,
        12723071, 6695497, 3148618, 2911359, 4461153,
        4659690, 1339057, 6035802, 6882635, 9984072,
        5606249, 2981020, 6121623, 1060665, 1925614,
        3027341, 1353465, 8886025, 2092741, 19530351,
        10381615, 758080, 11676341, 3940235, 4181886,
        12800922, 1058287, 5084156, 878698, 6771631,
        28628666, 3153550, 624358, 8501286, 7523869, 1804291, 5807406, 577601
    ]
}

homelessness = pd.DataFrame(data)

In [9]:
# Print the head of the homelessness data
print(homelessness.head())

               region       state  individuals  family_members  state_pop
0  East South Central     Alabama       2570.0           864.0    4887681
1             Pacific      Alaska       1434.0           582.0     735139
2            Mountain     Arizona       7259.0          2606.0    7158024
3  West South Central    Arkansas       2280.0           432.0    3009733
4             Pacific  California     109008.0         20964.0   39461588


In [10]:
# Print information about homelessness
print(homelessness.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 5 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   region          51 non-null     object 
 1   state           51 non-null     object 
 2   individuals     51 non-null     float64
 3   family_members  51 non-null     float64
 4   state_pop       51 non-null     int64  
dtypes: float64(2), int64(1), object(2)
memory usage: 2.1+ KB
None


In [11]:
# Print the shape of homelessness
print(homelessness.shape)

(51, 5)


In [12]:
# Print a description of homelessness
print(homelessness.describe())

         individuals  family_members     state_pop
count      51.000000       51.000000  5.100000e+01
mean     7225.784314     3504.882353  6.405637e+06
std     15991.025083     7805.411811  7.327258e+06
min       434.000000       75.000000  5.776010e+05
25%      1446.500000      592.000000  1.777414e+06
50%      3082.000000     1482.000000  4.461153e+06
75%      6781.500000     3196.000000  7.340946e+06
max    109008.000000    52070.000000  3.946159e+07


## Exercise: Parts of a DataFrame

To work effectively with DataFrames in pandas, it's important to understand their internal structure. A DataFrame is composed of three main components, accessible as attributes:

- `.values`: A two-dimensional NumPy array that contains the actual data.
- `.columns`: An index object that holds the names of the columns.
- `.index`: An index object that identifies the rows, which can be numeric or labeled.

You can typically think of these indexes as lists of strings or numbers, although the `Index` object in pandas supports more advanced features (which you'll learn later in the course).

### Instructions:

1. Display the raw 2D NumPy array containing the data from `homelessness`.
2. Display the column names of the `homelessness` DataFrame.
3. Display the index (row labels) of the `homelessness` DataFrame.


In [14]:
# Print the values of homelessness
print(homelessness.values)

[['East South Central' 'Alabama' 2570.0 864.0 4887681]
 ['Pacific' 'Alaska' 1434.0 582.0 735139]
 ['Mountain' 'Arizona' 7259.0 2606.0 7158024]
 ['West South Central' 'Arkansas' 2280.0 432.0 3009733]
 ['Pacific' 'California' 109008.0 20964.0 39461588]
 ['Mountain' 'Colorado' 7607.0 3250.0 5691287]
 ['New England' 'Connecticut' 2280.0 1696.0 3571520]
 ['South Atlantic' 'Delaware' 708.0 374.0 965479]
 ['South Atlantic' 'District of Columbia' 3770.0 3134.0 701547]
 ['South Atlantic' 'Florida' 21443.0 9587.0 21244317]
 ['South Atlantic' 'Georgia' 6943.0 2556.0 10511131]
 ['Pacific' 'Hawaii' 4131.0 2399.0 1420593]
 ['Mountain' 'Idaho' 1297.0 715.0 1750536]
 ['East North Central' 'Illinois' 6752.0 3891.0 12723071]
 ['East North Central' 'Indiana' 3776.0 1482.0 6695497]
 ['West North Central' 'Iowa' 1711.0 1038.0 3148618]
 ['West North Central' 'Kansas' 1443.0 773.0 2911359]
 ['East South Central' 'Kentucky' 2735.0 953.0 4461153]
 ['West South Central' 'Louisiana' 2540.0 519.0 4659690]
 ['New 

In [15]:
# Print the column index of homelessness
print(homelessness.columns)

Index(['region', 'state', 'individuals', 'family_members', 'state_pop'], dtype='object')


In [38]:
# Print the row index of homelessness
print(homelessness.index)

RangeIndex(start=0, stop=51, step=1)
