# Week 7 Instructor-Led Lab: DataFrame Exploration  
**Author:** Thomas J. Greenberg  
**Course:** BGEN632 – Graduate Introduction to Python  
**Term:** Spring 2025  
**Date:** April 14, 2025  

## Purpose and Methods  

 This lab covers pandas data manipulation techniques using the **Loblolly.csv** dataset, 
 including indexing, slicing, sorting, transformation, metadata exploration, K-Fold splits, 
 and datetime handling.

Here are the methods:
 - **OS module** for file navigation.
 - **pd.read_csv()** for data loading.
 - **.iloc[]** and **.loc[]** for indexing.
 - Slicing to retrieve specific rows and columns.
 - **pd.unique()** to identify distinct values.
 - Metadata checks with **.shape**, **.dtypes**, and **.info()**.
 - Handling missing data with **isnull()** and **notnull()**.
 - Combining data with **pd.concat()**.
 - Sorting using **sort_values()** and sampling with **sample()**.
 - Converting data types with **.astype()**.
 - Date/time handling using **pd.to_datetime()**.
 - Implementing train/test splits with **KFold**.

 Before reading datasets check the working directory (set to the lab materials folder). 

In [5]:
!pip install scikit-learn




[notice] A new release of pip is available: 24.3.1 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [6]:
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold

## Set Working Directory

In [8]:
os.getcwd()

'C:\\MySystem\\School\\Python\\GitHubStuff\\week7labs'

In [9]:
os.chdir("C:/MySystem/School/Python/GitHubStuff/week7labs/data")

## Load Dataset: Loblolly.csv

In [10]:
loblolly = pd.read_csv("Loblolly.csv")
loblolly.head()

Unnamed: 0,height,age,Seed
0,4.51,3,301
1,4.55,3,303
2,4.79,3,305
3,3.91,3,307
4,4.81,3,309


## Preview the DataFrame  

After reading in the dataset, you can quickly check its contents using three common options:

- `df.head()` – displays the first five rows (default)  
- `df.tail()` – displays the last five rows (default)  
- Simply typing the DataFrame name (e.g., `loblolly`) displays both the first and last five rows

This behavior helps with basic verification before moving on to indexing or cleaning.

In [26]:
loblolly          
loblolly.head(10)  
loblolly.tail(10) 

Unnamed: 0,height,age,Seed
74,63.05,25,309
75,59.64,25,311
76,60.07,25,315
77,60.69,25,319
78,60.28,25,321
79,61.62,25,323
80,58.49,25,325
81,56.81,25,327
82,56.43,25,329
83,59.49,25,331


## Inspect DataFrame Metadata  
Check **.shape** for dimensions, **.columns** for variable names,  
and **.info()** for data types and missing values.

In [11]:
print("Shape:", loblolly.shape)
print("\nColumns:", loblolly.columns.tolist())
loblolly.info()

Shape: (84, 3)

Columns: ['height', 'age', 'Seed']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84 entries, 0 to 83
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   height  84 non-null     float64
 1   age     84 non-null     int64  
 2   Seed    84 non-null     int64  
dtypes: float64(1), int64(2)
memory usage: 2.1 KB


## Use `.iloc[]` for Index-Based Access  
Use **.iloc[]** to slice rows and columns by position. Employ basic slicing `[start:stop]`,  
list-based row selection like `[[0, 4, 7]]`, and column isolation with `[:, [1, 3]]`.

In [24]:
loblolly.iloc[0:3]
loblolly.iloc[[0, 1, 2], [1, 2]]
loblolly.iloc[:, 1:3]

Unnamed: 0,age,Seed
0,3,301
1,3,303
2,3,305
3,3,307
4,3,309
...,...,...
79,25,323
80,25,325
81,25,327
82,25,329


### Use `.loc[]` for Label-Based Access
`.loc[]` targets rows and columns by label. For example, label slicing `['A':'D']` (inclusive),  
selecting columns with `[:, ['height','age']]`, or boolean filtering `df.loc[df['age'] > 10]`.

In [12]:
loblolly.loc[:, 'age'] 
loblolly.loc[0:3, 'height':'age']  
loblolly.loc[[0, 1, 2], ['age', 'Seed']]

Unnamed: 0,age,Seed
0,3,301
1,3,303
2,3,305


## Detecting Missing Values
Understanding the structure of your DataFrame is essential before analysis. This section checks:

- Unique values in columns (e.g., `age`)
- Data types with `.dtypes`
- Missing data using `pd.isnull()` and `pd.notnull()`

In [13]:
print("Unique ages:", pd.unique(loblolly['age']))

print("\nData types:\n", loblolly.dtypes)

print("\nMissing values:\n", pd.isnull(loblolly).sum())

print("\nComplete data flags:\n", pd.notnull(loblolly).all())

Unique ages: [ 3  5 10 15 20 25]

Data types:
 height    float64
age         int64
Seed        int64
dtype: object

Missing values:
 height    0
age       0
Seed      0
dtype: int64

Complete data flags:
 height    True
age       True
Seed      True
dtype: bool


## Add New Rows to DataFrame

In [14]:
 new_rows = pd.DataFrame({
    'height': [71.22, 85.05, 68.34],
    'age': [30, 30, 30],
    'Seed': [400, 401, 402]
})
loblolly_mod_1 = pd.concat([loblolly, new_rows], ignore_index=True)
loblolly_mod_1.tail()

Unnamed: 0,height,age,Seed
82,56.43,25,329
83,59.49,25,331
84,71.22,30,400
85,85.05,30,401
86,68.34,30,402


## Add New Column to DataFrame

In [15]:
new_cols = pd.DataFrame({
    'diameter': np.random.randint(1, 4, size=len(loblolly_mod_1))
})
loblolly_mod_2 = pd.concat([loblolly_mod_1, new_cols], axis=1)
loblolly_mod_2.head()

Unnamed: 0,height,age,Seed,diameter
0,4.51,3,301,1
1,4.55,3,303,2
2,4.79,3,305,3
3,3.91,3,307,3
4,4.81,3,309,1


## Rename Columns

In [16]:
loblolly_mod_2.rename(columns={'diameter': 'trunk_diameter'}, inplace=True)
loblolly_mod_2.columns

Index(['height', 'age', 'Seed', 'trunk_diameter'], dtype='object')

## Sort DataFrame

In [17]:
loblolly_mod_2.sort_values(by='height').head()
loblolly_mod_2.nlargest(6, 'height')
loblolly_mod_2.sort_values(by=['height', 'age'])

Unnamed: 0,height,age,Seed,trunk_diameter
13,3.46,3,331,2
8,3.77,3,321,2
5,3.88,3,311,2
3,3.91,3,307,3
12,3.93,3,329,2
...,...,...,...,...
71,63.39,25,303,3
72,64.10,25,305,3
86,68.34,30,402,2
84,71.22,30,400,2


## Manual Sampling

In [18]:
sample_size = int(len(loblolly_mod_2.index) * 0.1)
sample = loblolly_mod_2.sample(n=sample_size, replace=False)
sample.head()

Unnamed: 0,height,age,Seed,trunk_diameter
24,10.48,5,325,1
80,58.49,25,325,1
55,39.15,15,331,3
58,55.82,20,305,2
7,4.57,3,319,1


## Convert to Categorical Type

In [19]:
loblolly_mod_2['Seed'] = loblolly_mod_2['Seed'].astype('category')
loblolly_mod_2.dtypes

height             float64
age                  int64
Seed              category
trunk_diameter       int32
dtype: object

## DateTime Example with Tennis Dataset

In [20]:
tennis = pd.read_csv("tennis_serve_time.csv")
tennis['date'] = pd.to_datetime(tennis['date'])
tennis.dtypes

rownames                int64
server                 object
sec_between             int64
opponent               object
game_score             object
set                     int64
game                   object
date           datetime64[ns]
dtype: object

## K-Fold Split Example

In [21]:
kf = KFold(n_splits=2)
for train_index, test_index in kf.split(loblolly_mod_2):
    print("TRAIN:", train_index, "TEST:", test_index)
    break  

TRAIN: [44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86] TEST: [ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43]


## References

- [pandas Docs](https://pandas.pydata.org)  
- [scikit-learn Docs](https://scikit-learn.org)  
- ChatGPT (OpenAI) – Cell copying and importing issues. (see accompanying screenshots)