<a href="https://colab.research.google.com/github/sondip702/Skill-Morph-assignment/blob/main/SkillMorph_4_NumPy_and_Pandas_for_Data_Science.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# 1. Introduction to NumPy and Pandas for Data Science
This section introduces the basics of NumPy and Pandas for data science.

## 1.1 NumPy Basics
Learn how to create arrays and perform basic array operations in NumPy.

## 1.2 Pandas Basics
Get familiar with creating DataFrames and performing basic operations.

# 2. Mathematical Operations in NumPy
Explore mathematical operations on NumPy arrays.

## 2.1 Element-wise Operations
Learn element-wise addition, multiplication, and other operations.

## 2.2 Broadcasting in NumPy
Understanding how broadcasting works in NumPy for array operations.

# 3. Statistical Operations in NumPy
Perform statistical operations like mean, standard deviation, etc.

# 4. Data Manipulation in Pandas
Learn how to manipulate data using Pandas DataFrames.

## 4.1 Creating DataFrames
Learn how to create and manipulate DataFrames.

## 4.2 DataFrame Operations
Perform basic operations like filtering, grouping, and sorting in Pandas.


In [None]:
# Install required packages (run this first in Google Colab)
!pip install numpy pandas matplotlib

# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
print("NumPy version:", np.__version__)
print("Pandas version:", pd.__version__)

NumPy version: 2.0.2
Pandas version: 2.2.2


In [None]:
# Creating NumPy arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print("1D Array:", arr1)
print("2D Array:\n", arr2)
print("Array1 shape:", arr1.shape)
print("Array1 dtype:", arr1.dtype)
print("Array2 shape:", arr2.shape)
print("Array2 dtype:", arr2.dtype)

1D Array: [1 2 3 4 5]
2D Array:
 [[1 2 3]
 [4 5 6]]
Array1 shape: (5,)
Array1 dtype: int64
Array2 shape: (2, 3)
Array2 dtype: int64


In [None]:
# Different ways to create arrays
zeros_array = np.zeros((3, 4))
ones_array = np.ones((2, 3))
range_array = np.arange(0, 10, 2)
linspace_array = np.linspace(0, 1, 4)

print("Zeros array:\n", zeros_array)
print("Ones array:\n", ones_array)
print("Range array:", range_array)
print("Linspace array:", linspace_array)

Zeros array:
 [[0. 0. 0. 0.]
 [0. 0. 0. 0.]
 [0. 0. 0. 0.]]
Ones array:
 [[1. 1. 1.]
 [1. 1. 1.]]
Range array: [0 2 4 6 8]
Linspace array: [0.         0.33333333 0.66666667 1.        ]


In [None]:
# Mathematical operations on arrays
a = np.array([1, 2, 3, 4, 5])
b = np.array([2, 3, 4, 5, 6])

# Element-wise operations
print("Addition:", a + b)
print("Multiplication:", a * b)
print("Power:", a ** 2)
print("Square root:", np.sqrt(a))


Addition: [ 3  5  7  9 11]
Multiplication: [ 2  6 12 20 30]
Power: [ 1  4  9 16 25]
Square root: [1.         1.41421356 1.73205081 2.         2.23606798]


In [None]:
# Statistical operations
print("Mean:", np.mean(a))
print("Standard deviation:", np.std(a))
print("Min:", np.min(a))
print("Max:", np.max(a))



Mean: 3.0
Standard deviation: 1.4142135623730951
Min: 1
Max: 5


In [None]:
# Broadcasting example
matrix = np.array([[1, 2, 3], [4, 5, 6]])
vector = np.array([10, 20, 30])
result = matrix + vector
print("Broadcasting result:\n", result)

In [None]:
arr = np.array([10, 20, 30, 40, 50])

# Indexing
print("Element at index 2:", arr[2])

# Slicing
print("Sliced array (from index 1 to 3):", arr[1:4])

# 2D array slicing
arr_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Sliced 2D array (first two rows):\n", arr_2d[:2])

Element at index 2: 30
Sliced array (from index 1 to 3): [20 30 40]
Sliced 2D array (first two rows):
 [[1 2 3]
 [4 5 6]]


In [None]:
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [24, 27, 22],
        'City': ['New York', 'Los Angeles', 'Chicago']}

df = pd.DataFrame(data)
print("DataFrame:\n", df)

DataFrame:
       Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago


In [None]:
# Basic DataFrame information
print("\nDataFrame Info:")
print(df.info())
print("\nDataFrame Description:")
print(df.describe())
print("\nDataFrame Shape:", df.shape)
print("Column names:", df.columns.tolist())
# Display first 5 rows
print("First 5 rows of the dataset:")
print(df.head())
# Display last 5 rows
print("\nLast 5 rows of the dataset:")
print(df.tail())
# Check data types
print("\nData Types:")
print(df.dtypes)


DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    3 non-null      object
 1   Age     3 non-null      int64 
 2   City    3 non-null      object
dtypes: int64(1), object(2)
memory usage: 204.0+ bytes
None

DataFrame Description:
             Age
count   3.000000
mean   24.333333
std     2.516611
min    22.000000
25%    23.000000
50%    24.000000
75%    25.500000
max    27.000000

DataFrame Shape: (3, 3)
Column names: ['Name', 'Age', 'City']
First 5 rows of the dataset:
      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago

Last 5 rows of the dataset:
      Name  Age         City
0    Alice   24     New York
1      Bob   27  Los Angeles
2  Charlie   22      Chicago

Data Types:
Name    object
Age      int64
City    object
dtype: object


In [None]:
# Accessing a single column
print("Names column:\n", df['Name'])

# Accessing multiple columns
print("\nName and Age columns:\n", df[['Name', 'Age']])


Names column:
 0      Alice
1        Bob
2    Charlie
Name: Name, dtype: object

Name and Age columns:
       Name  Age
0    Alice   24
1      Bob   27
2  Charlie   22


In [None]:
# Convert NumPy array to DataFrame
np_array = np.random.randn(5, 3)
df_from_array = pd.DataFrame(np_array, columns=['A', 'B', 'C'])
print("DataFrame from NumPy array:")
print(df_from_array)

DataFrame from NumPy array:
          A         B         C
0 -0.448212  0.005379  0.216114
1 -1.840020  1.762773 -0.356492
2 -0.621479  0.242264  0.588236
3 -0.879346  0.348669  0.914871
4  0.245984  0.238310 -2.124657


In [None]:
# Create sample sales data
sales_data = {
    'Region': ['North', 'South', 'East', 'West', 'North', 'South', 'East', 'West'],
    'Product': ['A', 'A', 'B', 'B', 'A', 'B', 'A', 'B'],
    'Sales': [100, 150, 200, 120, 110, 180, 190, 140],
    'Quantity': [10, 15, 20, 12, 11, 18, 19, 14]
}
sales_df = pd.DataFrame(sales_data)
print("Sales DataFrame:")
print(sales_df)

# Group by operations
region_sales = sales_df.groupby('Region')['Sales'].sum()
print("\nSales by Region:")
print(region_sales)


Sales DataFrame:
  Region Product  Sales  Quantity
0  North       A    100        10
1  South       A    150        15
2   East       B    200        20
3   West       B    120        12
4  North       A    110        11
5  South       B    180        18
6   East       A    190        19
7   West       B    140        14

Sales by Region:
Region
East     390
North    210
South    330
West     260
Name: Sales, dtype: int64


In [6]:
from google.colab import drive
import warnings
warnings.filterwarnings('ignore')

In [7]:
drive.mount('/content/drive')
file_path = '/content/drive/MyDrive/datasets/diabetes.csv'
columns = ['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome']
df = pd.read_csv(file_path, names=columns)  # No header in raw CSV
print("Dataset loaded successfully. Shape:", df.shape)

Mounted at /content/drive


FileNotFoundError: [Errno 2] No such file or directory: '/content/drive/MyDrive/datasets/diabetes.csv'

In [None]:
from google.colab import files
uploaded = files.upload()

# Load the uploaded file
import io
df = pd.read_csv(io.BytesIO(uploaded['diabetesversion1.csv']))

KeyError: 'diabetesversion1.csv'

In [None]:
data_array = df.values
print("NumPy array shape:", data_array.shape)
print("Data type:", data_array.dtype)


NumPy array shape: (769, 9)
Data type: object


In [None]:
# Quick overview
print(df.head())
print("\nDataset info:")
print(df.info())

# Basic statistics
print("\nDescriptive statistics:")
print(df.describe())

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0  Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI   
1            6      148             72             35        0  33.6   
2            1       85             66             29        0  26.6   
3            8      183             64              0        0  23.3   
4            1       89             66             23       94  28.1   

   DiabetesPedigreeFunction  Age  Outcome  
0  DiabetesPedigreeFunction  Age  Outcome  
1                     0.627   50        1  
2                     0.351   31        0  
3                     0.672   32        1  
4                     0.167   21        0  

Dataset info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 769 entries, 0 to 768
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   Pregnancies               769 non-null    object
 1   Glucose          

In [8]:
#missing values
print("Missing values in each column:")
print(df.isnull().sum())

Missing values in each column:


NameError: name 'df' is not defined

In [None]:
# Import necessary library
import numpy as np

# Create a NumPy array with 20 random integers between 1 and 100


# Display the NumPy array


# 1. Find the sum of all elements
# The np.sum() function computes the sum of all elements in the array.


# 2. Find the maximum and minimum values in the array
# np.max() returns the maximum value and np.min() returns the minimum value in the array.



In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np

# 1. Create a DataFrame with 5 rows and 3 columns of random numbers
# The DataFrame is generated from a 5x3 matrix of random integers between 1 and 100.


# Display the DataFrame


# 2. Compute the sum of each column
# The .sum() function computes the sum for each column in the DataFrame.


# 3. Compute the mean of each column
# The .mean() function computes the mean (average) of each column in the DataFrame.

