# <center> R - Creating and Analyzing Data Frames </center>
### What are Data frames?
- Databases are used to store data. In R, data frames provide this functionality
- Data frames are `identically matrices`, but with `named` rows and columns
- Data frames can be exported to and from `CSV files`

Both pandas in Python and R's data frames are powerful and widely used for data manipulation and analysis. The choice between them often depends on your personal preferences, project requirements, and existing knowledge. Here are some factors to consider when comparing the two:

**Pandas (Python):**

1. **General-Purpose Language:** Python is a general-purpose programming language, which means you can use it for a wide range of tasks beyond data analysis. This makes it a versatile choice for many projects.

2. **Ecosystem:** Python has a large and active ecosystem with a rich set of libraries and tools for various purposes, including machine learning, web development, and scientific computing.

3. **Pandas:** Pandas is a powerful library for data manipulation and analysis. It provides extensive functionality for cleaning, transforming, and analyzing tabular data.

4. **Integration:** Python seamlessly integrates with other data science and machine learning libraries such as NumPy, scikit-learn, and TensorFlow, making it a popular choice for end-to-end data science projects.

5. **Community and Resources:** Python has a large and active community, which means you can find plenty of tutorials, documentation, and support online.

**R Data Frames (R):**

1. **Specialized Language:** R is a specialized language for statistical computing and data analysis. It excels in statistical modeling and has a rich set of statistical packages.

2. **Data Analysis Focus:** R was designed with a strong focus on data analysis, which means it often offers more specialized and domain-specific tools for data manipulation and visualization.

3. **ggplot2:** R's ggplot2 package is renowned for its powerful and flexible data visualization capabilities.

4. **Community:** R has a dedicated and active community of statisticians and data analysts, making it a preferred choice for many data-centric tasks.

**Which One Is More Powerful?**

The concept of "power" can be subjective and context-dependent. Both Python with pandas and R offer powerful tools for data analysis, and each has its strengths in different areas. The choice often depends on your specific use case and personal preferences.

- If you need to integrate data analysis into a broader software project, `Python might be a better choice due to its versatility`.
- If your primary focus is on statistical analysis, data visualization, and data exploration, `R might be more suitable`.

Many data scientists and analysts are proficient in both Python and R and choose the tool that best fits the task at hand. Ultimately, your familiarity with the language and the specific requirements of your project will play a significant role in determining which one is more suitable for your needs.

In [1]:
# Let's create a simple data frame:

n = c(1:3)
s = c("a","b","c")

# Recall in data.frame(header1,header2,..headerN)

df = data.frame(n,s)
print(df)

  n s
1 1 a
2 2 b
3 3 c


In [2]:
# Recall to call the column names as a list
headers = colnames(df)
headers

# Rename ALL column names in the dataframe
colnames(df) = c('NN','SS')
new_headers = colnames(df)
new_headers

# Rename only a Specific column name in the dataframe
# colnames(df[,index#]) = c('new_name')

In [3]:
# Recall to call the row names 
rows = rownames(df)
rows

# Rename ALL rows in the the dataframe
rownames(df) = c('rowA','rowB','rowC')
new_rows = rownames(df)
new_rows

# Rename only a Specific row name in the dataframe:
# rownames(df[index#,]) = c('new_name')

In [4]:
# Let's look what are dataframe looks now
print(df)

     NN SS
rowA  1  a
rowB  2  b
rowC  3  c


#### Now, see that we have edited our dataframe into our liking

 #### Calling entries in the dataframe

In [5]:
# To call a specific cell/entry in our dataframe

rowA = df['rowA',]
print(rowA)
class(rowA)

rowA_colN = df['rowA','NN']
print(rowA_colN)
class(rowA_colN)

# Alternative way of calling the row+col value:
print(df[1,1])  # Easy if we know the exact coordinates

     NN SS
rowA  1  a


[1] 1


[1] 1


# <font color=blue> More context about summar() function in analyzing Dataframes (ChatGpt)</font>

In R, the `summary()` function is used to generate a summary of the contents and statistical properties of a data frame or a matrix. It provides a concise overview of the dataset, including measures of central tendency, spread, and basic categorical information. The specific information provided by `summary()` varies depending on the data type of the variables within the data frame. Here's what `summary()` typically includes for different variable types:

1. **Numeric Variables:**
   - For numeric variables (e.g., continuous or integer), `summary()` typically provides:
     - Minimum value
     - 1st Quartile (25th percentile)
     - Median (2nd Quartile or 50th percentile)
     - Mean (average)
     - 3rd Quartile (75th percentile)
     - Maximum value

2. **Factor (Categorical) Variables:**
   - For factor variables (e.g., categorical variables), `summary()` shows the count of each category (level) within the variable.

3. **Character Variables:**
   - For character variables (e.g., strings), `summary()` may display the length of the strings, the most frequent values, and the number of unique values.

4. **Logical Variables:**
   - For logical variables (e.g., Boolean), `summary()` provides the count of `TRUE` and `FALSE` values.

#### Here's an example of how to use `summary()` on a data frame in R:

In [6]:
# Create a sample data frame
df <- data.frame(
  Name = c("Alice", "Bob", "Charlie", "David"),
  Age = c(28, 35, 22, 29),
  Gender = factor(c("Female", "Male", "Male", "Male")),
  Married = c(TRUE, FALSE, TRUE, FALSE)
)
df

Name,Age,Gender,Married
Alice,28,Female,True
Bob,35,Male,False
Charlie,22,Male,True
David,29,Male,False


In [7]:
# Generate a summary of the data frame
summary(df)

      Name        Age          Gender   Married       
 Alice  :1   Min.   :22.0   Female:1   Mode :logical  
 Bob    :1   1st Qu.:26.5   Male  :3   FALSE:2        
 Charlie:1   Median :28.5              TRUE :2        
 David  :1   Mean   :28.5                             
             3rd Qu.:30.5                             
             Max.   :35.0                             

The output of `summary(df)` will provide a summary of the data frame's columns, showing the minimum, 1st quartile, median, mean, 3rd quartile, and maximum values for numeric variables, and counts for factor and logical variables. It's a quick way to get an initial understanding of the distribution of data in each column of your data frame.

# Let's do a dataframe analysis using the Built-in Dataframe examples in R

In [8]:
mtcars  # show the entire data in mtcars

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160.0,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160.0,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108.0,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258.0,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360.0,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225.0,105,2.76,3.46,20.22,1,0,3,1
Duster 360,14.3,8,360.0,245,3.21,3.57,15.84,0,0,3,4
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4


In [9]:
head(mtcars) # Show the first few lines of data in mtcars
tail(mtcars) # Show the last few lines of data in mtcars

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21.0,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21.0,6,160,110,3.9,2.875,17.02,0,1,4,4
Datsun 710,22.8,4,108,93,3.85,2.32,18.61,1,1,4,1
Hornet 4 Drive,21.4,6,258,110,3.08,3.215,19.44,1,0,3,1
Hornet Sportabout,18.7,8,360,175,3.15,3.44,17.02,0,0,3,2
Valiant,18.1,6,225,105,2.76,3.46,20.22,1,0,3,1


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8
Volvo 142E,21.4,4,121.0,109,4.11,2.78,18.6,1,1,4,2


In [10]:
header_names = colnames(mtcars)  # returns all column name in the mtcars dataframe as a list
header_names

In [11]:
# Let's get the summary of the mtcars dataframe
summary(mtcars)

      mpg             cyl             disp             hp       
 Min.   :10.40   Min.   :4.000   Min.   : 71.1   Min.   : 52.0  
 1st Qu.:15.43   1st Qu.:4.000   1st Qu.:120.8   1st Qu.: 96.5  
 Median :19.20   Median :6.000   Median :196.3   Median :123.0  
 Mean   :20.09   Mean   :6.188   Mean   :230.7   Mean   :146.7  
 3rd Qu.:22.80   3rd Qu.:8.000   3rd Qu.:326.0   3rd Qu.:180.0  
 Max.   :33.90   Max.   :8.000   Max.   :472.0   Max.   :335.0  
      drat             wt             qsec             vs        
 Min.   :2.760   Min.   :1.513   Min.   :14.50   Min.   :0.0000  
 1st Qu.:3.080   1st Qu.:2.581   1st Qu.:16.89   1st Qu.:0.0000  
 Median :3.695   Median :3.325   Median :17.71   Median :0.0000  
 Mean   :3.597   Mean   :3.217   Mean   :17.85   Mean   :0.4375  
 3rd Qu.:3.920   3rd Qu.:3.610   3rd Qu.:18.90   3rd Qu.:1.0000  
 Max.   :4.930   Max.   :5.424   Max.   :22.90   Max.   :1.0000  
       am              gear            carb      
 Min.   :0.0000   Min.   :3.000  

In [12]:
# Select row indices that have the pattern 'Merc' in their names

mrc = grep("Merc",rownames(mtcars))  # This will contain a list of row indices with the pattern
mrc

# Now Let's call these indices
mtcars[mrc, ]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Merc 240D,24.4,4,146.7,62,3.69,3.19,20.0,1,0,4,2
Merc 230,22.8,4,140.8,95,3.92,3.15,22.9,1,0,4,2
Merc 280,19.2,6,167.6,123,3.92,3.44,18.3,1,0,4,4
Merc 280C,17.8,6,167.6,123,3.92,3.44,18.9,1,0,4,4
Merc 450SE,16.4,8,275.8,180,3.07,4.07,17.4,0,0,3,3
Merc 450SL,17.3,8,275.8,180,3.07,3.73,17.6,0,0,3,3
Merc 450SLC,15.2,8,275.8,180,3.07,3.78,18.0,0,0,3,3


In [14]:
# From the Quiz:
# Question 2: How many cars have the minimum mileage of 10.40?

mn = min(mtcars[,'mpg'])
mn_ind = which(mtcars[,'mpg'] == mn);  # No context given, but will explore in the coming weeks
mtcars[mn_ind,]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Cadillac Fleetwood,10.4,8,472,205,2.93,5.25,17.98,0,0,3,4
Lincoln Continental,10.4,8,460,215,3.0,5.424,17.82,0,0,3,4


**Answer: 2 cars only**

In [1]:
# From the Quiz:
# Question 3: How many Fiats are in mtcars?

# We need to find all cars that have the pattern "Fiat" in row_name
car_names = rownames(mtcars)
car_names

fiat = grep("Fiat", car_names)  # This only returns the list of indices row position, but that's all we need
mtcars[fiat,]  # This is to show what those cars are

how_many_cars_withName_Fiat = length(fiat)  # Just determine the count of elements in the list

# Return with an excellent complete statement
cat ("There is/are", how_many_cars_withName_Fiat, 'count(s) of Fiat car(s) in the dataframe', '\n')


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Fiat 128,32.4,4,78.7,66,4.08,2.2,19.47,1,1,4,1
Fiat X1-9,27.3,4,79.0,66,4.08,1.935,18.9,1,1,4,1


There is/are 2 count(s) of Fiat car(s) in the dataframe 
