<a href="https://colab.research.google.com/github/shruti63-code/nhanes-adult/blob/main/matrices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Working with numpy Matrices (Multidimensional Data)

In [None]:
!pip install numpy
import numpy as np
!pip install pandas
import pandas as pd
import matplotlib.pyplot as plt
import urllib.request
import os



This code snippet is essentially preparing the environment by installing necessary libraries and making them available for use within the code.

**Installing numpy:**
* !pip install numpy: This line uses pip, a package installer for Python, to install the numpy library. numpy is a powerful library for numerical computations in Python, providing support for arrays, matrices, and mathematical functions.
* import numpy as np: This line imports the numpy library and assigns it a shorter alias np for easier use throughout the code. This means whenever you see np in the code, it refers to the numpy library.

**Installing pandas:**

* !pip install pandas: Similar to numpy, this line installs the pandas library using pip. pandas is a library built on top of numpy, providing data structures like DataFrames for efficient data manipulation and analysis.
* import pandas as pd: This line imports the pandas library and assigns it an alias pd, making it easier to reference.
Importing other libraries:

In [None]:
# File URLs
male_data_url = "https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_adult_male_bmx_2020.csv"
female_data_url = "https://github.com/gagolews/teaching-data/raw/master/marek/nhanes_adult_female_bmx_2020.csv"


In simpler terms:

*These variables as labels you put on containers. The labels are male_data_url and female_data_url. Inside the containers, you have the addresses to where the actual male and female data files are located on the internet (GitHub).*

In [None]:
# Load data into pandas DataFrames with error handling
try:
    male_data = pd.read_csv(male_data_url, on_bad_lines='skip')
    female_data = pd.read_csv(female_data_url, on_bad_lines='skip')
except pd.errors.ParserError as e:
    print("Error loading data:", e)
    raise


In essence, this code snippet tries to load data from two CSV files into pandas DataFrames. If it encounters parsing errors, it skips the problematic lines and then prints an error message before stopping the execution to prevent further issues.

In [None]:
# Print column names to check their availability
print("Columns in Male Data:", male_data.columns)
print("Columns in Female Data:", female_data.columns)



Columns in Male Data: Index(['# Body measurements of males >= 18 years old [cm]'], dtype='object')
Columns in Female Data: Index(['# Body measurements of females >= 18 years old [cm]', ' no missing data.'], dtype='object')


*These lines are designed to display the names of the columns (headers) within the male_data and female_data DataFrames.*

**In simpler terms:**

This line takes the column names from the male_data DataFrame and prints them on the screen with a descriptive label.

The main reason for including these lines is to verify and ensure that the data has been loaded correctly and that the expected columns are present in both datasets. By printing the column names, the user can quickly confirm if the data structure matches their expectations and proceed with further analysis.

In [None]:
# Verify column names and adjust if necessary
columns_to_select = ['BMXWT', 'BMXHT', 'BMXARML', 'BMXLEG', 'BMXARMC', 'BMXHIP', 'BMXWAIST']


This line is creating a list in Python. Think of a list as an ordered container that holds multiple items.

**In simpler terms:**

Imagine you have a spreadsheet with lots of columns, but you're only interested in a few specific ones. This line is like writing down the names of those specific columns you want to work with on a piece of paper. columns_to_select is the name you've given to that piece of paper, and the items in the list are the actual column names you've written down.

**Why is this important?**

Later in the code, this list (columns_to_select) will likely be used to select or extract only those specific columns from a larger dataset. This is a common way to focus you

In [None]:
# Check if required columns exist
missing_male_cols = [col for col in columns_to_select if col not in male_data.columns]
missing_female_cols = [col for col in columns_to_select if col not in female_data.columns]

if missing_male_cols:
    print(f"Missing columns in Male Data: {missing_male_cols}")
if missing_female_cols:
    print(f"Missing columns in Female Data: {missing_female_cols}")



Missing columns in Male Data: ['BMXWT', 'BMXHT', 'BMXARML', 'BMXLEG', 'BMXARMC', 'BMXHIP', 'BMXWAIST']
Missing columns in Female Data: ['BMXWT', 'BMXHT', 'BMXARML', 'BMXLEG', 'BMXARMC', 'BMXHIP', 'BMXWAIST']


In simpler terms:

*Imagine you have a checklist (columns_to_select) of items you need. You then compare this checklist to the items you have in two boxes (your datasets - male_data and female_data). The code checks each box for the items on your checklist. If any item is missing from a box, it makes a note of it and tells you which items are missing from which box. This ensures that you have all the necessary data before you proceed with your analysis.*

In [None]:
# Display the first few rows of each dataset
print("Male Matrix (First 8 Rows):")
print(male_data[:8])

print("\nFemale Matrix (First 8 Rows):")
print(female_data[:8])


Male Matrix (First 8 Rows):
  # Body measurements of males >= 18 years old [cm]
0                                                 #
1                                     # Weight (kg)
2                            # Standing Height (cm)
3                           # Upper Arm Length (cm)
4                           # Upper Leg Length (cm)
5                          # Arm Circumference (cm)
6                          # Hip Circumference (cm)
7                        # Waist Circumference (cm)

Female Matrix (First 8 Rows):
  # Body measurements of females >= 18 years old [cm]   no missing data.
0                                                  #                 NaN
1                                      # Weight (kg)                 NaN
2                             # Standing Height (cm)                 NaN
3                            # Upper Arm Length (cm)                 NaN
4                            # Upper Leg Length (cm)                 NaN
5                           # Arm C

This code snippet is designed to show you the first 8 rows of both the male_data and female_data datasets. This is a common step in data analysis to get a quick glimpse of what the data looks like.

**In simpler terms:**

*You have two spreadsheets, one for male data and one for female data. This code is like taking the top 8 rows from each spreadsheet and printing them on the screen so you can see a preview of the data in each. This allows you to quickly understand the structure and content of your datasets before doing further analysis.*

In [None]:
 #Print the shapes to verify
print("Male Matrix Shape:", male_data.shape)
print("Female Matrix Shape:", female_data.shape)

Male Matrix Shape: (17, 1)
Female Matrix Shape: (17, 2)


**In Simpler Terms:**

 You have two tables of data. These lines of code are essentially telling you the size of each table: how many rows and columns each table contains. This information is important for understanding the structure of your data and for performing further analysis.

*Example:*

If the output of male_data.shape is (1234, 10), it means the male_data DataFrame has 1234 rows and 10 columns. Similarly, if female_data.shape is (1500, 10), it means the female_data DataFrame has 1500 rows and 10 columns.